On this page

Article’s publication date extractor – an overview

December 13, 2015 4 min

Article’s publication date extractor – an overview

A few days ago I’ve released an open source Python module that provides you with a simple way to extract and normalize the publication date of any online blog or news post. There are some commercial solutions out there, but why not just use this module for free?

The logic behind the code

Here at Webz.io we use multiple methods to automatically detect and extract the date out of articles, blog posts and comments. A publication date can appear in various ways and multiple formats. It can be based on a numerical format (i.e 01/02/2015), a textual format (i.e Yesterday), or even a combination of them both (Jan 1st, 2015). Not to mention that there could be multiple types of separators and the following date, 01/02/2015, can be interpreted as January 2nd, or February 1st (depending if it’s the American or European format).

Fortunately there are standards out there. Unfortunately, there are A LOT of standards! The date extraction function tries multiple methods to accurately extract and normalize the date.

Try the URL
More often than not the date exists in the URL of the post, but since it doesn’t include the time, we try to extract it as a fallback, in case other methods fail. We use a regular expression to try and match against multiple formats (1/1/2015, 1-1-2015, 1.1.2015,1_1_2015).

Here is the regular expression we use:
([./-_]{0,1}(19|20)d{2})[./-_]{0,1}(([0-3]{0,1}[0-9][./-_])|(w{3,5}[./-_]))([0-3]{0,1}[0-9][./-]{0,1})?

Try JSON-LD
JSON-LD is an easy-to-use JSON-based linked data format that defines the concept of context to specify the vocabulary for types and properties. Some documents specify the creation or publication date using this methods, it’s always worth a try!

JSON-LD markup example:

{
“@context”: “https://www.w3.org/ns/activitystreams”,
“@type”: “Create”,
“actor”: {
“@type”: “Person”,
“@id”: “acct:[email protected]”,
“displayName”: “Sally”
},
“object”: {
“@type”: “Note”,
“content”: “This is a simple note”
},
“published”: “2015-01-25T12:34:56Z”
}

META to the rescue?
If JSON-LD fails (it usually does), we try to look in the document’s meta tags for the date. There are many types of meta tags (a lots of standards remember?) so we try to go over all of the different formats.

Some META tags examples:
<meta name=”article.published” content=”2015-11-26T11:53:00.000Z” />
<meta property=”bt:pubDate” content=”2015-11-26T00:10:33+00:00″>
<meta name=”DC.date.issued” content=”2015-11-26″>
<meta name=”pubdate” content=”2015-11-26T07:11:02Z” >

Last resort – the HTML
With the risk of loosing accuracy, if all fails we look into the HTML. A mixed of standards and popular date annotations are evaluated in order to find the elusive date:

Unifying the date

Once we find the textual date, we unify it using the excellent python-dateutil module. It’s an amazing solution that converts textual date, into a datetime object.

Parsing the document

In order to parse the HTML document, we use Beautiful Soup. It has a powerful parsing capabilities, and it’s very simple to use. For the JSON-LD part, we use the built in JSON module, to load and parse the JSON.

Precision and Recall

We tested the “Article Date Extractor” module against Google’s news feed, and got close to 100% precision with almost 90% recall. You can of course increase this recall by adding more patterns to the HTML extraction function, but you are risking in a lower precision score.

Contribute

That’s it, feel free to share it, use it, and contribute if you feel you can make this module better.

Ran Geva

CEO

Spread the news

Subscribe to our blog for more news and updates!

Read Up

How to Automate Supply Chain Risk Reports: A Guide for Developers

Do you use Python? If so, this guide will help you automate supply chain risk reports using AI Chat GPT and our News API.

How to Automate Supply Chain Risk Reports: A Guide for Product Managers

Use this guide to learn how to easily automate supply chain risk reports with Chat GPT and news data.

How to Automate Mergers and Acquisitions Reports: A Guide for Developers

A quick guide for developers to automate mergers and acquisitions reports with Python and AI. Learn to fetch data, analyze content, and generate reports automatically.

Article’s publication date extractor – an overview

The logic behind the code

Unifying the date

Parsing the document

Precision and Recall

Contribute

Ran Geva

Subscribe to our blog for more news and updates!

Read Up

How to Automate Supply Chain Risk Reports: A Guide for Developers

How to Automate Supply Chain Risk Reports: A Guide for Product Managers

How to Automate Mergers and Acquisitions Reports: A Guide for Developers

Power Your Insights with Data You Can Trust

Ready to Explore Web Data at Scale?