Skip to content

jonaskahn/article-extractor

 
 

Repository files navigation

Refined version of @extractus/article-extractor

Output response change like below:

{
    "url": "https://www.androidauthority.com/samsung-galaxy-s24-ultra-camera-samples-3404790/",
    "source": "androidauthority.com",
    "meta": {
        "title": "Here are some Samsung Galaxy S24 Ultra camera samples shot in San Jose",
        "description": "Today, Samsung took the wraps off its latest line of premium flagship smartphones: the Samsung Galaxy S24 series. In this lineup, the most powerful phone is the mighty Ultra variant. With five lenses on the...",
        "links": [
            "https://www.androidauthority.com/samsung-galaxy-s24-ultra-camera-samples-3404790/"
        ],
        "cover": "https://www.androidauthority.com/wp-content/uploads/2024/01/Samsung-Galaxy-S24-Ultra-1-1.jpg",
        "favicon": "",
        "author": "@c_scottbrown",
        "published": "2024-01-18T01:08:11+00:00"
    },
    "images": [],
    "content": "Today, Samsung took the wraps off its latest line of premium flagship smartphones: the Samsung Galaxy S24 series. In this lineup, the most powerful phone is the mighty Ultra variant. With five lenses on the back — including a newly upgraded 50MP periscope telephoto lens — this is clearly designed for folks who care a lot about camera phones. But how good is it? Well, here are some Samsung Galaxy S24 Ultra camera samples for you to see for yourself.To be clear, this shouldn’t be seen as an in-depth camera analysis for this phone. Instead, view these as an early sample of an average person wandering around San Jose and using the camera in various situations. In other words, I didn’t try to push the limits — I just did some point-and-shoot experiments.If you want to view the full-resolution, unedited samples I captured, check out our public Drive. The samples shown on this page have been heavily compressed, so don’t bother pixel-peeping them.Samsung Galaxy S24 Ultra camera samples: Selfies\nThe selfie camera hardware on the Galaxy S24 Ultra is exactly the same as we saw on the Samsung Galaxy S23 Ultra, so these results shouldn’t be surprising. As one would expect, the outdoors photo looks best, while the one shot indoors in a low-light situation looks mushy and undefined.One thing I immediately noticed, though, is that you need to be careful with the phone’s display. Samsung bumped the peak brightness to an astonishing 2,600 nits, so it now is like a miniature sun when you’re taking a selfie. Be sure to use adaptive brightness or manually turn it down when you’re in a low-light situation, or you’ll see some abysmal results.Portrait mode\nSamsung’s portrait mode has always been a market leader, and it doesn’t look like much has changed here. The artificial bokeh around the statues is uniform and even gets into places you’d think it would miss, such as the gaps between the arms and the bodies.Ultrawide\nOnce again, since Samsung didn’t change anything about the ultrawide lens on the Galaxy S24 Ultra, we see predictably terrific results here. With ultrawide lenses, photos of real-world straight lines can look wonky because of the curve of the lens, but Samsung’s post-processing has no issue with these roof panels.Zoom\nI did two sessions here because this is where Samsung made significant changes. As a quick recap, the Galaxy S22 Ultra and S23 Ultra had 10x optical periscope zoom on a 10MP sensor. The Galaxy S24 Ultra, though, has 5x optical periscope zoom on a 50MP sensor — so a much better sensor, but less room for the preservation of visual fidelity.In the example above, you can see that cropping, typical smartphone image enhancement, and Samsung’s new Galaxy AI tricks can do a pretty decent job beyond the 5x mark. While I wouldn’t say the 30x shot is spectacular, the 10x shot looks terrific when you understand that that’s not an optical capture.Let’s look at another example where things take a bit of a turn:Once again, the 10x zoom shot here is pretty great for a non-optical shot. Likewise, the 30x shot looks crystal clear, with all the text being completely legible from very far away.The 100x shot is a bit wonky, though. Samsung’s sharpening/processing went too far and made the text legible by making it look artificial. Of course, going beyond 30x on a smartphone is not advisable in the first place, but if you were hoping for AI-powered magic for 100x zoom, there’s still a lot of work left to do.Color, macro, and more\nThese final shots show off some of the typical things you might face while taking photos. The shot with harsh sunlight, in particular, is something most folks will need to deal with at some point.The flower shot is a bit intense. Samsung loves to make photos really pop with color, despite how unrealistic it might look. It doesn’t appear that’s changed with the Galaxy S24 Ultra, as this flower was bright, sure, but not that bright.I do love the macro photo of the tree bark. The texture looks fantastic, and there’s so much detail. The fountain photo is a bit underwhelming, but to Samsung’s credit, it was moving very fast.Don’t forget to check out our public Drive for high-resolution versions of all these photos.\n\nStay tuned for much more in-depth examinations of the Galaxy S24 series, including a full review. In the meantime, what do you think of these photos? Do they make you more/less likely to buy a Galaxy S24 Ultra or leave you unfazed? Let us know in the comments.",
    "rawContent": "<div><p>Today, Samsung took the wraps off its latest line of premium flagship smartphones: <a target=\"_blank\" href=\"https://www.androidauthority.com/samsung-galaxy-s24-3321740/\">the Samsung Galaxy S24 series</a>. In this lineup, the most powerful phone is the mighty Ultra variant. With five lenses on the back — including a newly upgraded 50MP periscope telephoto lens — this is clearly designed for folks who care a lot about <a target=\"_blank\" href=\"https://www.androidauthority.com/best-camera-phones-670620/\">camera phones</a>. But how good is it? Well, here are some Samsung Galaxy S24 Ultra camera samples for you to see for yourself.</p><p>To be clear, this shouldn’t be seen as an in-depth camera analysis for this phone. Instead, view these as an early sample of an average person wandering around San Jose and using the camera in various situations. In other words, I didn’t try to push the limits — I just did some point-and-shoot experiments.</p><p>If you want to view the full-resolution, unedited samples I captured, <a href=\"https://drive.google.com/drive/folders/1OPVxBcjyxj1zUYqL-1GP2JS7BjnZbQlv?usp=sharing\" target=\"_blank\">check out our public Drive</a>. The samples shown on this page have been heavily compressed, so don’t bother pixel-peeping them.</p><p></p><h2>Samsung Galaxy S24 Ultra camera samples: Selfies</h2>\n<p></p><p>The selfie camera hardware on the Galaxy S24 Ultra is exactly the same as we saw on the <a target=\"_blank\" href=\"https://www.androidauthority.com/samsung-galaxy-s23-ultra-review-3280337/\">Samsung Galaxy S23 Ultra</a>, so these results shouldn’t be surprising. As one would expect, the outdoors photo looks best, while the one shot indoors in a low-light situation looks mushy and undefined.</p><p>One thing I immediately noticed, though, is that you need to be careful with the phone’s display. Samsung bumped the peak brightness to an astonishing 2,600 nits, so it now is like a miniature sun when you’re taking a selfie. Be sure to use adaptive brightness or manually turn it down when you’re in a low-light situation, or you’ll see some abysmal results.</p><p></p><h2>Portrait mode</h2>\n<p></p><p>Samsung’s portrait mode has always been a market leader, and it doesn’t look like much has changed here. The artificial bokeh around the statues is uniform and even gets into places you’d think it would miss, such as the gaps between the arms and the bodies.</p><p></p><h2>Ultrawide</h2>\n<p></p><p>Once again, since Samsung didn’t change anything about the ultrawide lens on the Galaxy S24 Ultra, we see predictably terrific results here. With ultrawide lenses, photos of real-world straight lines can look wonky because of the curve of the lens, but Samsung’s post-processing has no issue with these roof panels.</p><p></p><h2>Zoom</h2>\n<p></p><p>I did two sessions here because this is where Samsung made significant changes. As a quick recap, the <a target=\"_blank\" href=\"https://www.androidauthority.com/galaxy-s22-ultra-vs-s23-ultra-3266173/\">Galaxy S22 Ultra</a> and S23 Ultra had 10x optical periscope zoom on a 10MP sensor. The Galaxy S24 Ultra, though, has 5x optical periscope zoom on a 50MP sensor — so a much better sensor, but less room for the preservation of visual fidelity.</p><p>In the example above, you can see that cropping, typical smartphone image enhancement, and Samsung’s new Galaxy AI tricks can do a pretty decent job beyond the 5x mark. While I wouldn’t say the 30x shot is spectacular, the 10x shot looks terrific when you understand that that’s <strong>not</strong> an optical capture.</p><p>Let’s look at another example where things take a bit of a turn:</p><p>Once again, the 10x zoom shot here is pretty great for a non-optical shot. Likewise, the 30x shot looks crystal clear, with all the text being completely legible from very far away.</p><p>The 100x shot is a bit wonky, though. Samsung’s sharpening/processing went too far and made the text legible by making it look artificial. Of course, going beyond 30x on a smartphone is not advisable in the first place, but if you were hoping for AI-powered magic for 100x zoom, there’s still a lot of work left to do.</p><p></p><h2>Color, macro, and more</h2>\n<p></p><p>These final shots show off some of the typical things you might face while taking photos. The shot with harsh sunlight, in particular, is something most folks will need to deal with at some point.</p><p>The flower shot is a bit intense. Samsung loves to make photos really pop with color, despite how unrealistic it might look. It doesn’t appear that’s changed with the Galaxy S24 Ultra, as this flower was bright, sure, but not <strong>that</strong> bright.</p><p>I do love the macro photo of the tree bark. The texture looks fantastic, and there’s so much detail. The fountain photo is a bit underwhelming, but to Samsung’s credit, it was moving very fast.</p><div><p>Don’t forget to check out <a href=\"https://drive.google.com/drive/folders/1OPVxBcjyxj1zUYqL-1GP2JS7BjnZbQlv?usp=sharing\" target=\"_blank\">our public Drive</a> for high-resolution versions of all these photos.</p>\n<hr />\n<p>Stay tuned for much more in-depth examinations of the Galaxy S24 series, including a full review. In the meantime, what do you think of these photos? Do they make you more/less likely to buy a Galaxy S24 Ultra or leave you unfazed? Let us know in the comments.</p>\n</div></div>",
    "ttr": 154,
    "type": "article"
}

@extractus/article-extractor

Extract main article, main image and meta data from URL.

npm version CodeQL CI test Coverage Status

(This library is derived from article-parser renamed.)

Demo

Install & Usage

Node.js

npm i @extractus/article-extractor

# pnpm
pnpm i @extractus/article-extractor

# yarn
yarn add @extractus/article-extractor
// es6 module
import { extract } from '@extractus/article-extractor'

Deno

import { extract } from 'https://esm.sh/@extractus/article-extractor'

// deno > 1.28
import { extract } from 'npm:@extractus/article-extractor'

Browser

import { extract } from 'https://esm.sh/@extractus/article-extractor'

Please check the examples for reference.

APIs


extract()

Load and extract article data. Return a Promise object.

Syntax

extract(String input)
extract(String input, Object parserOptions)
extract(String input, Object parserOptions, Object fetchOptions)

Example:

import { extract } from '@extractus/article-extractor'

const input = 'https://www.cnbc.com/2022/09/21/what-another-major-rate-hike-by-the-federal-reserve-means-to-you.html'

// here we use top-level await, assume current platform supports it
try {
  const article = await extract(input)
  console.log(article)
} catch (err) {
  console.error(err)
}

The result - article - can be null or an object with the following structure:

{
  url: String,
  title: String,
  description: String,
  image: String,
  author: String,
  favicon: String,
  content: String,
  published: Date String,
  type: String, // page type
  source: String, // original publisher
  links: Array, // list of alternative links
  ttr: Number, // time to read in second, 0 = unknown
}

Parameters

input required

URL string links to the article or HTML content of that web page.

parserOptions optional

Object with all or several of the following properties:

  • wordsPerMinute: Number, to estimate time to read. Default 300.
  • descriptionTruncateLen: Number, max num of chars generated for description. Default 210.
  • descriptionLengthThreshold: Number, min num of chars required for description. Default 180.
  • contentLengthThreshold: Number, min num of chars required for content. Default 200.

For example:

import { extract } from '@extractus/article-extractor'

const article = await extract('https://www.cnbc.com/2022/09/21/what-another-major-rate-hike-by-the-federal-reserve-means-to-you.html', {
  descriptionLengthThreshold: 120,
  contentLengthThreshold: 500
})

console.log(article)
fetchOptions optional

fetchOptions is an object that can have the following properties:

  • headers: to set request headers
  • proxy: another endpoint to forward the request to
  • agent: a HTTP proxy agent
  • signal: AbortController signal or AbortSignal timeout to terminate the request

For example, you can use this param to set request headers to fetch as below:

import { extract } from '@extractus/article-extractor'

const url = 'https://www.cnbc.com/2022/09/21/what-another-major-rate-hike-by-the-federal-reserve-means-to-you.html'
const article = await extract(url, {}, {
  headers: {
    'user-agent': 'Opera/9.60 (Windows NT 6.0; U; en) Presto/2.1.1'
  }
})

console.log(article)

You can also specify a proxy endpoint to load remote content, instead of fetching directly.

For example:

import { extract } from '@extractus/article-extractor'

const url = 'https://www.cnbc.com/2022/09/21/what-another-major-rate-hike-by-the-federal-reserve-means-to-you.html'

await extract(url, {}, {
  headers: {
    'user-agent': 'Opera/9.60 (Windows NT 6.0; U; en) Presto/2.1.1'
  },
  proxy: {
    target: 'https://your-secret-proxy.io/loadXml?url=',
    headers: {
      'Proxy-Authorization': 'Bearer YWxhZGRpbjpvcGVuc2VzYW1l...'
    },
  }
})

Passing requests to proxy is useful while running @extractus/article-extractor on browser. View examples/browser-article-parser as reference example.

For more info about proxy authentication, please refer HTTP authentication

For a deeper customization, you can consider using Proxy to replace fetch behaviors with your own handlers.

Another way to work with proxy is use agent option instead of proxy as below:

import { extract } from '@extractus/article-extractor'

import { HttpsProxyAgent } from 'https-proxy-agent'

const proxy = 'http://abc:RaNdoMpasswORd_country-France@proxy.packetstream.io:31113'

const url = 'https://www.cnbc.com/2022/09/21/what-another-major-rate-hike-by-the-federal-reserve-means-to-you.html'

const article = await extract(url, {}, {
  agent: new HttpsProxyAgent(proxy),
})
console.log('Run article-extractor with proxy:', proxy)
console.log(article)

For more info about https-proxy-agent, check its repo.

By default, there is no request timeout. You can use the option signal to cancel request at the right time.

The common way is to use AbortControler:

const controller = new AbortController()

// stop after 5 seconds
setTimeout(() => {
  controller.abort()
}, 5000)

const data = await extract(url, null, {
  signal: controller.signal,
})

A newer solution is AbortSignal's timeout() static method:

// stop after 5 seconds
const data = await extract(url, null, {
  signal: AbortSignal.timeout(5000),
})

For more info:

extractFromHtml()

Extract article data from HTML string. Return a Promise object as same as extract() method above.

Syntax

extractFromHtml(String html)
extractFromHtml(String html, String url)
extractFromHtml(String html, String url, Object parserOptions)

Example:

import { extractFromHtml } from '@extractus/article-extractor'

const url = 'https://www.cnbc.com/2022/09/21/what-another-major-rate-hike-by-the-federal-reserve-means-to-you.html'

const res = await fetch(url)
const html = await res.text()

// you can do whatever with this raw html here: clean up, remove ads banner, etc
// just ensure a html string returned

const article = await extractFromHtml(html, url)
console.log(article)

Parameters

html required

HTML string which contains the article you want to extract.

url optional

URL string that indicates the source of that HTML content. article-extractor may use this info to handle internal/relative links.

parserOptions optional

See parserOptions above.


Transformations

Sometimes the default extraction algorithm may not work well. That is the time when we need transformations.

By adding some functions before and after the main extraction step, we aim to come up with a better result as much as possible.

There are 2 methods to play with transformations:

  • addTransformations(Object transformation | Array transformations)
  • removeTransformations(Array patterns)

At first, let's talk about transformation object.

transformation object

In @extractus/article-extractor, transformation is an object with the following properties:

  • patterns: required, a list of regexps to match the URLs
  • pre: optional, a function to process raw HTML
  • post: optional, a function to process extracted article

Basically, the meaning of transformation can be interpreted like this:

with the urls which match these patterns
let's run pre function to normalize HTML content
then extract main article content with normalized HTML, and if success
let's run post function to normalize extracted article content

article-extractor extraction process

Here is an example transformation:

{
  patterns: [
    /([\w]+.)?domain.tld\/*/,
    /domain.tld\/articles\/*/
  ],
  pre: (document) => {
    // remove all .advertise-area and its siblings from raw HTML content
    document.querySelectorAll('.advertise-area').forEach((element) => {
      if (element.nodeName === 'DIV') {
        while (element.nextSibling) {
          element.parentNode.removeChild(element.nextSibling)
        }
        element.parentNode.removeChild(element)
      }
    })
    return document
  },
  post: (document) => {
    // with extracted article, replace all h4 tags with h2
    document.querySelectorAll('h4').forEach((element) => {
      const h2Element = document.createElement('h2')
      h2Element.innerHTML = element.innerHTML
      element.parentNode.replaceChild(h2Element, element)
    })
    // change small sized images to original version
    document.querySelectorAll('img').forEach((element) => {
      const src = element.getAttribute('src')
      if (src.includes('domain.tld/pics/150x120/')) {
        const fullSrc = src.replace('/pics/150x120/', '/pics/original/')
        element.setAttribute('src', fullSrc)
      }
    })
    return document
  }
}

addTransformations(Object transformation | Array transformations)

Add a single transformation or a list of transformations. For example:

import { addTransformations } from '@extractus/article-extractor'

addTransformations({
  patterns: [
    /([\w]+.)?abc.tld\/*/
  ],
  pre: (document) => {
    // do something with document
    return document
  },
  post: (document) => {
    // do something with document
    return document
  }
})

addTransformations([
  {
    patterns: [
      /([\w]+.)?def.tld\/*/
    ],
    pre: (document) => {
      // do something with document
      return document
    },
    post: (document) => {
      // do something with document
      return document
    }
  },
  {
    patterns: [
      /([\w]+.)?xyz.tld\/*/
    ],
    pre: (document) => {
      // do something with document
      return document
    },
    post: (document) => {
      // do something with document
      return document
    }
  }
])

The transformations without patterns will be ignored.

removeTransformations(Array patterns)

To remove transformations that match the specific patterns.

For example, we can remove all added transformations above:

import { removeTransformations } from '@extractus/article-extractor'

removeTransformations([
  /([\w]+.)?abc.tld\/*/,
  /([\w]+.)?def.tld\/*/,
  /([\w]+.)?xyz.tld\/*/
])

Calling removeTransformations() without parameter will remove all current transformations.

Priority order

While processing an article, more than one transformation can be applied.

Suppose that we have the following transformations:

[
  {
    patterns: [
      /http(s?):\/\/google.com\/*/,
      /http(s?):\/\/goo.gl\/*/
    ],
    pre: function_one,
    post: function_two
  },
  {
    patterns: [
      /http(s?):\/\/goo.gl\/*/,
      /http(s?):\/\/google.inc\/*/
    ],
    pre: function_three,
    post: function_four
  }
]

As you can see, an article from goo.gl certainly matches both them.

In this scenario, @extractus/article-extractor will execute both transformations, one by one:

function_one -> function_three -> extraction -> function_two -> function_four


sanitize-html's options

@extractus/article-extractor uses sanitize-html to make a clean sweep of HTML content.

Here is the default options

Depending on the needs of your content system, you might want to gather some HTML tags/attributes, while ignoring others.

There are 2 methods to access and modify these options in @extractus/article-extractor.

  • getSanitizeHtmlOptions()
  • setSanitizeHtmlOptions(Object sanitizeHtmlOptions)

Read sanitize-html docs for more info.


Test

git clone https://github.com/extractus/article-extractor.git
cd article-extractor
pnpm i
pnpm test

article-extractor-test.png

Quick evaluation

git clone https://github.com/extractus/article-extractor.git
cd article-extractor
pnpm i
pnpm eval {URL_TO_PARSE_ARTICLE}

License

The MIT License (MIT)

Support the project

If you find value from this open source project, you can support in the following ways:

Thank you.


About

A refined version - To extract main article from given URL with Node.js

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Packages

No packages published

Languages

  • JavaScript 68.1%
  • HTML 31.9%