Name	Name	Last commit message	Last commit date
Latest commit History 1 Commit
crawler	crawler
examples	examples
pages	pages
.gitignore	.gitignore
Dockerfile	Dockerfile
LICENSE	LICENSE
README.md	README.md
docker-compose.yml	docker-compose.yml
main.py	main.py
requirements.txt	requirements.txt
setup.py	setup.py
test.py	test.py

Crawl4AI 🕷️🤖

Crawl4AI is a powerful, free web crawling service designed to extract useful information from web pages and make it accessible for large language models (LLMs) and AI applications. 🆓🌐

Features ✨

🕷️ Efficient web crawling to extract valuable data from websites
🤖 LLM-friendly output formats (JSON, cleaned HTML, markdown)
🌍 Supports crawling multiple URLs simultaneously
🌃 Replace media tags with ALT.
🆓 Completely free to use and open-source

Getting Started 🚀

To get started with Crawl4AI, simply visit our web application at https://crawl4ai.your-domain.io and enter the URL(s) you want to crawl. The application will process the URLs and provide you with the extracted data in various formats.

Installation 💻

To install and run Crawl4AI locally or on your own server, follow these steps:

Clone the repository:

git clone https://github.com/your-username/crawl4ai.git

Navigate to the project directory:

cd crawl4ai

Create a .env file in the root folder and set your Groq API token:

GROQ_API_TOKEN=your_groq_api_token

Build the Docker image:

docker build -t crawl4ai .

For Mac users, use the following command instead:

docker build --platform linux/amd64 -t crawl4ai .

Run the Docker container:

docker run -p 8000:80 crawl4ai

Access the application at http://localhost:8000.

For more detailed instructions and advanced configuration options, please refer to the installation guide.

Usage with Python 🐍

Here's an example of how to use Crawl4AI with Python to crawl a webpage and retrieve the extracted data:

Make sure you have the requests library installed. You can install it using pip:

pip install requests

Use the following Python code to send a request to the Crawl4AI server and retrieve the crawled data:

import requests
import os

url = "http://localhost:8000/crawl"  # Replace with the appropriate server URL
data = {
  "urls": [
    "https://example.com"
  ],
  "provider_model": "groq/llama3-70b-8192",
  "api_token": "your_api_token",
  "include_raw_html": true,
  "forced": false,
  "extract_blocks": true,
  "word_count_threshold": 5
}

response = requests.post(url, json=data)

if response.status_code == 200:
    result = response.json()["results"][0]
    print("Parsed JSON:")
    print(result["parsed_json"])
    print("\nCleaned HTML:")
    print(result["cleaned_html"])
    print("\nMarkdown:")
    print(result["markdown"])
else:
    print("Error:", response.status_code, response.text)

This code sends a POST request to the Crawl4AI server running on localhost, specifying the target URL (https://example.com) and the desired options (grq_api_token, include_raw_html, and forced). The server processes the request and returns the crawled data in JSON format.

The response from the server includes the parsed JSON, cleaned HTML, and markdown representations of the crawled webpage. You can access and use this data in your Python application as needed.

Make sure to replace "http://localhost:8000/crawl" with the appropriate server URL if your Crawl4AI server is running on a different host or port.

Using Crawl4AI as a Python Library 📚

You can also use Crawl4AI as a Python library in your own projects. Here's an example of how to use the Crawl4AI library:

Install the required dependencies:

pip install -r requirements.txt

Import the necessary modules and initialize the WebCrawler:

from crawler.web_crawler import WebCrawler
from crawler.models import UrlModel
import os

crawler = WebCrawler(db_path='crawler_data.db')

Fetch a single page:

single_url = UrlModel(url='https://kidocode.com', forced=True)
result = crawler.fetch_page(
    single_url, 
    provider= "openai/gpt-3.5-turbo", 
    api_token = os.getenv('OPENAI_API_KEY'), 
    extract_blocks_flag=True,
    word_count_threshold=5 # Minimum word count for a HTML tag to be considered as a worthy block
)
print(result.model_dump())

Fetch multiple pages:

urls = [
    UrlModel(url='http://example.com', forced=False),
    UrlModel(url='http://example.org', forced=False)
]
results = crawler.fetch_pages(
    urls, 
    provider= "openai/gpt-3.5-turbo", 
    api_token = os.getenv('OPENAI_API_KEY'), 
    extract_blocks_flag=True, 
    word_count_threshold=5
)

for res in results:
    print(res.json())

This code demonstrates how to use the Crawl4AI library to fetch a single page or multiple pages. The WebCrawler is initialized with the path to the database, and the fetch_page and fetch_pages methods are used to crawl the specified URLs.

Make sure to set the GROQ_API_TOKEN environment variable with your Groq API token when using the library.

That's it! You can now integrate Crawl4AI into your Python projects and leverage its web crawling capabilities. 🎉

📖 Parameters

Parameter	Description	Required	Default Value
`urls`	A list of URLs to crawl and extract data from.	Yes	-
`provider_model`	The provider and model to use for extracting relevant information (e.g., "groq/llama3-70b-8192").	Yes	-
`api_token`	Your API token for the specified provider.	Yes	-
`include_raw_html`	Whether to include the raw HTML content in the response.	No	`false`
`forced`	Whether to force a fresh crawl even if the URL has been previously crawled.	No	`false`
`extract_blocks`	Whether to extract meaningful blocks of text from the HTML.	No	`false`
`word_count_threshold`	The minimum number of words a block must contain to be considered meaningful (minimum value is 5).	No	`5`

Contributing 🤝

We welcome contributions from the open-source community to help improve Crawl4AI and make it even more valuable for AI enthusiasts and developers. To contribute, please follow these steps:

Fork the repository.
Create a new branch for your feature or bug fix.
Make your changes and commit them with descriptive messages.
Push your changes to your forked repository.
Submit a pull request to the main repository.

For more information on contributing, please see our contribution guidelines.

License 📄

Crawl4AI is released under the MIT License.

Contact 📧

If you have any questions, suggestions, or feedback, please feel free to reach out to us:

GitHub: unclecode
Twitter: @unclecode
Discord: your-invite-link

Let's work together to make the web more accessible and useful for AI applications! 💪🌐🤖

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crawl4AI 🕷️🤖

Features ✨

Getting Started 🚀

Installation 💻

Usage with Python 🐍

Using Crawl4AI as a Python Library 📚

📖 Parameters

Contributing 🤝

License 📄

Contact 📧

About

Releases

Packages

Languages

License

lbowlsh/crawl4ai

Folders and files

Latest commit

History

Repository files navigation

Crawl4AI 🕷️🤖

Features ✨

Getting Started 🚀

Installation 💻

Usage with Python 🐍

Using Crawl4AI as a Python Library 📚

📖 Parameters

Contributing 🤝

License 📄

Contact 📧

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages