Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: extract plaintext / markdown #1

Open
joepio opened this issue Dec 20, 2023 · 1 comment
Open

Feature request: extract plaintext / markdown #1

joepio opened this issue Dec 20, 2023 · 1 comment

Comments

@joepio
Copy link

joepio commented Dec 20, 2023

Hi there! Thanks for creating and sharing this :)

One quite common use case with PDF libraries, is to get the text form a PDF. This is often used for things like indexing documents in a search engine. There is a project in Rust that does this called pdf-extract but I'd love to see an alternative to this (for a couple of reasons)

I noticed rspdf has a way to extract XML text from a PDF. I was wondering whether it would also be possible to extract content as plaintext? Or even better: extract it as markdown!

Perhaps this is completely out of scope for the project. Maybe I could help out with this someday (have some plans in this regard) if you think it may be a good fit.

Cheers!

@rockyzhengwu
Copy link
Owner

Thanks, i'm glad your have interest about this project.

This project is primarily centered around extracting text and images and converting them to other formats at now, plain text. Markdown support is a potential future addition.

However, there are numerous bugs to address, particularly related to fonts. Consequently, the timeline for completion is uncertain

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants