Skip to content

Commit

Permalink
Merge pull request mishushakov#32 from mishushakov/codegen
Browse files Browse the repository at this point in the history
Codegen
  • Loading branch information
mishushakov authored Jul 13, 2024
2 parents 4924529 + 61e6877 commit 64ef126
Show file tree
Hide file tree
Showing 8 changed files with 204 additions and 38 deletions.
27 changes: 20 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

<img width="1800" alt="Screenshot 2024-04-20 at 23 11 16" src="https://github.com/mishushakov/llm-scraper/assets/10400064/ab00e048-a9ff-43b6-81d5-2e58090e2e65">

LLM Scraper is a TypeScript library that allows you to convert **any** webpages into structured data using LLMs.
LLM Scraper is a TypeScript library that allows you to extract structured data from **any** webpage using LLMs.

> [!TIP]
> Under the hood, it uses function calling to convert pages to structured data. You can find more about this approach [here](https://til.simonwillison.net/gpt3/openai-python-functions-data-extraction)
Expand All @@ -14,7 +14,8 @@ LLM Scraper is a TypeScript library that allows you to convert **any** webpages
- Full type-safety with TypeScript
- Based on Playwright framework
- Streaming objects
- Supports 4 input modes:
- **NEW** Code-generation
- Supports 4 formatting modes:
- `html` for loading raw HTML
- `markdown` for loading markdown
- `text` for loading extracted text (using [Readability.js](https://github.com/mozilla/readability))
Expand Down Expand Up @@ -137,22 +138,34 @@ await page.close()
await browser.close()
```

### Streaming
## Streaming

Replace your `run` function with `stream` to get a partial object stream (Vercel AI SDK only).

```ts
// Run the scraper
const { stream } = await scraper.stream(page, schema, {
format: 'html',
})
// Run the scraper in streaming mode
const { stream } = await scraper.stream(page, schema)

// Stream the result from LLM
for await (const data of stream) {
console.log(data.top)
}
```

## NEW: Code-generation

Using the `generate` function you can generate re-usable playwright script that scrapes the contents according to a schema.

```ts
// Generate code and run it on the page
const { code } = await scraper.generate(page, schema)
const result = await page.evaluate(code)
const data = schema.parse(result)

// Show the parsed result
console.log(data.news)
```

## Contributing

As an open-source project, we welcome contributions from the community. If you are experiencing any bugs or want to add some improvements, please feel free to open an issue or pull request.
41 changes: 41 additions & 0 deletions examples/codegen.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
import { chromium } from 'playwright'
import { z } from 'zod'
import { anthropic } from '@ai-sdk/anthropic'
import LLMScraper from './../src'

// Launch a browser instance
const browser = await chromium.launch()

// Initialize LLM provider
const llm = anthropic('claude-3-5-sonnet-20240620')

// Create a new LLMScraper
const scraper = new LLMScraper(llm)

// Open new page
const page = await browser.newPage()
await page.goto('https://www.bbc.com')

// Define schema to extract contents into
const schema = z.object({
news: z.array(
z.object({
title: z.string(),
description: z.string(),
url: z.string(),
})
),
})

// Generate code and run it on the page
const { code } = await scraper.generate(page, schema)
console.log('code', code)

const result = await page.evaluate(code)
const data = schema.parse(result)

// Show the parsed result
console.log('result', data)

await page.close()
await browser.close()
2 changes: 1 addition & 1 deletion examples/streaming.ts
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ const schema = z.object({
.describe('Top 5 stories on Hacker News'),
})

// Run the scraper
// Run the scraper in streaming mode
const { stream } = await scraper.stream(page, schema, {
format: 'html',
})
Expand Down
62 changes: 57 additions & 5 deletions package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 2 additions & 1 deletion package.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"type": "module",
"name": "llm-scraper",
"version": "1.2.2",
"version": "1.5.0",
"description": "Turn any webpage intro structured data using LLMs",
"main": "dist/index.js",
"scripts": {
Expand Down Expand Up @@ -32,6 +32,7 @@
"zod-to-json-schema": "^3.22.5"
},
"devDependencies": {
"@ai-sdk/anthropic": "^0.0.30",
"@ai-sdk/openai": "^0.0.2",
"@types/node": "^20.12.7",
"@types/react": "^18.2.79",
Expand Down
1 change: 0 additions & 1 deletion src/cleanup.ts
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,6 @@ export default function cleanup() {
const attributesToRemove = [
'style',
'src',
'href',
'alt',
'title',
'role',
Expand Down
40 changes: 34 additions & 6 deletions src/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ import {
generateLlamaCompletions,
generateAISDKCompletions,
streamAISDKCompletions,
generateAISDKCode,
} from './models.js'

import cleanup from './cleanup.js'
Expand Down Expand Up @@ -107,7 +108,7 @@ export default class LLMScraper {
private async generateCompletions<T extends z.ZodSchema<any>>(
page: ScraperLoadResult,
schema: T,
options: ScraperRunOptions
options?: ScraperRunOptions
) {
switch (this.client.constructor) {
default:
Expand All @@ -126,7 +127,7 @@ export default class LLMScraper {
private async streamCompletions<T extends z.ZodSchema<any>>(
page: ScraperLoadResult,
schema: T,
options: ScraperRunOptions
options?: ScraperRunOptions
) {
switch (this.client.constructor) {
default:
Expand All @@ -137,27 +138,54 @@ export default class LLMScraper {
options
)
case LlamaModel:
throw new Error('Streaming not supported for local models yet')
throw new Error('Streaming not supported with GGUF models')
}
}

private async generateCode<T extends z.ZodSchema<any>>(
page: ScraperLoadResult,
schema: T,
options?: ScraperLLMOptions
) {
switch (this.client.constructor) {
default:
return generateAISDKCode<T>(
this.client as LanguageModelV1,
page,
schema,
options
)
case LlamaModel:
throw new Error('Code-generation not supported with GGUF models')
}
}

// Pre-process the page and generate completion
async run<T extends z.ZodSchema<any>>(
page: Page,
schema: T,
options: ScraperRunOptions
options?: ScraperRunOptions
) {
const preprocessed = await this.preprocess(page, options)
return this.generateCompletions<T>(preprocessed, schema, options)
}

// Pre-process the page and generate completion
// Pre-process the page and stream completion
async stream<T extends z.ZodSchema<any>>(
page: Page,
schema: T,
options: ScraperRunOptions
options?: ScraperRunOptions
) {
const preprocessed = await this.preprocess(page, options)
return this.streamCompletions<T>(preprocessed, schema, options)
}

// Pre-process the page and generate code
async generate(page, schema: z.ZodSchema<any>, options?: ScraperLLMOptions) {
const preprocessed = await this.preprocess(page, {
...options,
format: 'cleanup',
})
return this.generateCode(preprocessed, schema, options)
}
}
Loading

0 comments on commit 64ef126

Please sign in to comment.