Crawl4AI is a powerful, free web crawling service designed to extract useful information from web pages and make it accessible for large language models (LLMs) and AI applications. ππ
- π·οΈ Efficient web crawling to extract valuable data from websites
- π€ LLM-friendly output formats (JSON, cleaned HTML, markdown)
- π Supports crawling multiple URLs simultaneously
- π Replace media tags with ALT.
- π Completely free to use and open-source
To get started with Crawl4AI, simply visit our web application at https://crawl4ai.your-domain.io and enter the URL(s) you want to crawl. The application will process the URLs and provide you with the extracted data in various formats.
To install and run Crawl4AI locally or on your own server, follow these steps:
- Clone the repository:
git clone https://github.com/your-username/crawl4ai.git
- Navigate to the project directory:
cd crawl4ai
- Create a
.env
file in the root folder and set your Groq API token:
GROQ_API_TOKEN=your_groq_api_token
- Build the Docker image:
docker build -t crawl4ai .
For Mac users, use the following command instead:
docker build --platform linux/amd64 -t crawl4ai .
- Run the Docker container:
docker run -p 8000:80 crawl4ai
- Access the application at
http://localhost:8000
.
For more detailed instructions and advanced configuration options, please refer to the installation guide.
Here's an example of how to use Crawl4AI with Python to crawl a webpage and retrieve the extracted data:
- Make sure you have the
requests
library installed. You can install it using pip:
pip install requests
- Use the following Python code to send a request to the Crawl4AI server and retrieve the crawled data:
import requests
import os
url = "http://localhost:8000/crawl" # Replace with the appropriate server URL
data = {
"urls": [
"https://example.com"
],
"provider_model": "groq/llama3-70b-8192",
"api_token": "your_api_token",
"include_raw_html": true,
"forced": false,
"extract_blocks": true,
"word_count_threshold": 5
}
response = requests.post(url, json=data)
if response.status_code == 200:
result = response.json()["results"][0]
print("Parsed JSON:")
print(result["parsed_json"])
print("\nCleaned HTML:")
print(result["cleaned_html"])
print("\nMarkdown:")
print(result["markdown"])
else:
print("Error:", response.status_code, response.text)
This code sends a POST request to the Crawl4AI server running on localhost, specifying the target URL (https://example.com
) and the desired options (grq_api_token
, include_raw_html
, and forced
). The server processes the request and returns the crawled data in JSON format.
The response from the server includes the parsed JSON, cleaned HTML, and markdown representations of the crawled webpage. You can access and use this data in your Python application as needed.
Make sure to replace "http://localhost:8000/crawl"
with the appropriate server URL if your Crawl4AI server is running on a different host or port.
You can also use Crawl4AI as a Python library in your own projects. Here's an example of how to use the Crawl4AI library:
- Install the required dependencies:
pip install -r requirements.txt
- Import the necessary modules and initialize the
WebCrawler
:
from crawler.web_crawler import WebCrawler
from crawler.models import UrlModel
import os
crawler = WebCrawler(db_path='crawler_data.db')
- Fetch a single page:
single_url = UrlModel(url='https://kidocode.com', forced=True)
result = crawler.fetch_page(
single_url,
provider= "openai/gpt-3.5-turbo",
api_token = os.getenv('OPENAI_API_KEY'),
extract_blocks_flag=True,
word_count_threshold=5 # Minimum word count for a HTML tag to be considered as a worthy block
)
print(result.model_dump())
- Fetch multiple pages:
urls = [
UrlModel(url='http://example.com', forced=False),
UrlModel(url='http://example.org', forced=False)
]
results = crawler.fetch_pages(
urls,
provider= "openai/gpt-3.5-turbo",
api_token = os.getenv('OPENAI_API_KEY'),
extract_blocks_flag=True,
word_count_threshold=5
)
for res in results:
print(res.json())
This code demonstrates how to use the Crawl4AI library to fetch a single page or multiple pages. The WebCrawler
is initialized with the path to the database, and the fetch_page
and fetch_pages
methods are used to crawl the specified URLs.
Make sure to set the GROQ_API_TOKEN
environment variable with your Groq API token when using the library.
That's it! You can now integrate Crawl4AI into your Python projects and leverage its web crawling capabilities. π
Parameter | Description | Required | Default Value |
---|---|---|---|
urls |
A list of URLs to crawl and extract data from. | Yes | - |
provider_model |
The provider and model to use for extracting relevant information (e.g., "groq/llama3-70b-8192"). | Yes | - |
api_token |
Your API token for the specified provider. | Yes | - |
include_raw_html |
Whether to include the raw HTML content in the response. | No | false |
forced |
Whether to force a fresh crawl even if the URL has been previously crawled. | No | false |
extract_blocks |
Whether to extract meaningful blocks of text from the HTML. | No | false |
word_count_threshold |
The minimum number of words a block must contain to be considered meaningful (minimum value is 5). | No | 5 |
We welcome contributions from the open-source community to help improve Crawl4AI and make it even more valuable for AI enthusiasts and developers. To contribute, please follow these steps:
- Fork the repository.
- Create a new branch for your feature or bug fix.
- Make your changes and commit them with descriptive messages.
- Push your changes to your forked repository.
- Submit a pull request to the main repository.
For more information on contributing, please see our contribution guidelines.
Crawl4AI is released under the MIT License.
If you have any questions, suggestions, or feedback, please feel free to reach out to us:
- GitHub: unclecode
- Twitter: @unclecode
- Discord: your-invite-link
Let's work together to make the web more accessible and useful for AI applications! πͺππ€