A Go library for processing various content types into markdown/plaintext..
Chew is a Go library that processes various content types into markdown or plaintext. It supports multiple content types, including HTML, PDF, CSV, JSON, YAML, DOCX, PPTX, Markdown, Plaintext, MP3, FLAC, and WAVE.
go get github.com/mmatongo/chew
Here's a basic example of how to use Chew:
package main
import (
"context"
"fmt"
"log"
"time"
"github.com/mmatongo/chew/v1"
)
func main() {
urls := []string{
"https://example.com",
}
config := chew.Config{
UserAgent: "Chew/1.0 (+https://github.com/mmatongo/chew)",
RetryLimit: 3,
RetryDelay: 5 * time.Second,
CrawlDelay: 10 * time.Second,
ProxyList: []string{}, // Add your proxies here, or leave empty
RateLimit: 2 * time.Second,
RateBurst: 3,
IgnoreRobotsTxt: false,
}
haChew := chew.New(config)
// The context is optional, but can be used to cancel the operation after a certain time
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
chunks, err := haChew.Process(ctx, urls)
if err != nil {
if err == context.DeadlineExceeded {
log.Println("Operation timed out")
} else {
log.Printf("Error processing URLs: %v", err)
}
return
}
for _, chunk := range chunks {
fmt.Printf("Source: %s\nContent: %s\n\n", chunk.Source, chunk.Content)
}
}
Output
Source: https://example.com
Content: Example Domain
Source: https://example.com
Content: This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.
Source: https://example.com
Content: More information...
You can find more examples in the examples directory as well as instructions on how to use Chew with Ruby and Python.
Contributions are welcome! Feel free to open an issue or submit a pull request if you have any suggestions or improvements.
This project is licensed under the MIT License - see the LICENSE file for details.
The logo was made by the amazing MariaLetta.
The roadmap for this project is available here. It's meant more as a guide than a strict plan because I only work on this project in my free time.