Feature request: extract plaintext / markdown #1

joepio · 2023-12-20T10:03:19Z

Hi there! Thanks for creating and sharing this :)

One quite common use case with PDF libraries, is to get the text form a PDF. This is often used for things like indexing documents in a search engine. There is a project in Rust that does this called pdf-extract but I'd love to see an alternative to this (for a couple of reasons)

I noticed rspdf has a way to extract XML text from a PDF. I was wondering whether it would also be possible to extract content as plaintext? Or even better: extract it as markdown!

Perhaps this is completely out of scope for the project. Maybe I could help out with this someday (have some plans in this regard) if you think it may be a good fit.

Cheers!

The text was updated successfully, but these errors were encountered:

rockyzhengwu · 2023-12-20T12:51:52Z

Thanks, i'm glad your have interest about this project.

This project is primarily centered around extracting text and images and converting them to other formats at now, plain text. Markdown support is a potential future addition.

However, there are numerous bugs to address, particularly related to fonts. Consequently, the timeline for completion is uncertain

joepio mentioned this issue Dec 20, 2023

Atomizer + PDF extractor atomicdata-dev/atomic-server#591

Draft

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: extract plaintext / markdown #1

Feature request: extract plaintext / markdown #1

joepio commented Dec 20, 2023

rockyzhengwu commented Dec 20, 2023

Feature request: extract plaintext / markdown #1

Feature request: extract plaintext / markdown #1

Comments

joepio commented Dec 20, 2023

rockyzhengwu commented Dec 20, 2023