PDFDataExtractor is a toolkit for automatically extracting semantic information from PDF files of scientific articles, which features a template-based architecture with abilities to extract information from the following publishers, and more templates are currently under development:
- | Elsevier
- | Royal Society of Chemistry
- | Advanced Material Families (Wiley)
- | Angewandte
- | Chemistry A European Journal
- | American Chemistry Society
This guide provides a quick tour through PDFDataExtractor concepts and functionalities.
-
| Extract metadata information from scientific PDFs, including: title, anthor, abstract, journal name, journal year, journal volume, journal page number, doi, keywords, figure captions, section titles, heading, page number and references
-
| Chemistry-aware PDF information extraction
-
| Outputs PDF articles in plain text, JSON
-
| Extract articles from seven main stream chemistry and physics publishers with high precision
-
| Automated publisher detection
-
| Automated articles download from reference
- Web services for a more user friendly experience
- Supports for more publishers
PDFDataExtractor:
The paper is currently under review. This project was financially supported by the Science and Technology Facilities Council (STFC), the Royal Academy of Engineering (RCSRF1819\7\10), and BASF.