Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
src		src
.envrc		.envrc
.gitattributes		.gitattributes
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Manifest.toml		Manifest.toml
Project.toml		Project.toml
README.md		README.md
all.csv		all.csv
analysis.jl		analysis.jl
flake.lock		flake.lock
flake.nix		flake.nix

Repository files navigation

arXiv PDF Metadata analysis

A small effort to quantify how many papers on arXiv just don't bother with metadata.

Running

This project is just a small rust binary that will print out some info and save more comprehensive data to a results.csv:

cargo run --release

You can then generate some plots with Julia, running:

julia --project=.
> include("analysis.jl")
> main()

Datasets

If you don't have the space, time or need to download extra datasets, you can use the all.csv file in the root of this repo.

The notebook does not currently download the datasets itself, instead it expects them to be in a certain location.

# list available pdf groups
gsutil ls gs://arxiv-dataset/arxiv/arxiv/pdf

# make the data dir
mkdir -p arxiv-pdfs

# then, download one set
gsutil -m cp -r gs://arxiv-dataset/arxiv/arxiv/pdf/<yearmonth> ./arxiv-pdfs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

arXiv PDF Metadata analysis

Running

Datasets

About

Releases

Packages

Languages

jeffa5/arxiv-pdf-meta

Folders and files

Latest commit

History

Repository files navigation

arXiv PDF Metadata analysis

Running

Datasets

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages