Calling all (almost) Kimono Labs Developers to Migrate to Webz.io

Calling all (almost) Kimono Labs Developers to Migrate to Webz.io

Kimono Labs made an announcement today that it has been acquired by Palantir. Unfortunately Kimono Labs users will only have two weeks to migrate to a different service because the team will…

Extracting Data from Forums: 3 Sources to Discover What Your Market Really Thinks

Extracting Data from Forums: 3 Sources to Discover What Your Market Really Thinks

Robert Collier, the great ad man of the early 20th century, once summarized the secret of all effective marketing as entering “the conversation already taking place in the customer’s mind.” That’s powerful…

Article’s publication date extractor – an overview

Article’s publication date extractor – an overview

A few days ago I’ve released an open source Python module that provides you with a simple way to extract and normalize the publication date of any online blog or news post….

To crawl or not to crawl, that is the question

In order to write an efficient crawler, you must be smart about the content you download. When your crawler downloads an HTML page it uses bandwidth, memory and CPU, not only its…

Dead simple {for devs} python crawler (script) for extracting structured data from any  website into CSV

Dead simple {for devs} python crawler (script) for extracting structured data from any website into CSV

On my previous post I wrote about a very basic web crawler I wrote, that can randomly scour the web and mirror/download websites. Today I want to share with you a very simple…

Tiny basic multi-threaded web crawler in Python

Tiny basic multi-threaded web crawler in Python

If you need a simple web crawler that will scour the web for a while to download random site’s content – this code is for you. Usage:

Where https://cnn.com is your seed site. It could…

How we quadrupled the performance of Elasticsearch

How we quadrupled the performance of Elasticsearch

Well, that’s a misleading title. We actually quadrupled the performance of our brand monitoring alert system that uses Elasticsearch’s Percolator, but that would have been a much longer title. Some background Buzzilla…

Building a Better Search Query

Building a Better Search Query

Many factors can affect streaming data relevancy. When the data you consume isn’t ordered by relevancy, rather by the time it was crawled, getting the relevant posts is essential. I would like…

Webz.io Tips & Tricks: Content Marketing & SEO

Webz.io Tips & Tricks: Content Marketing & SEO

I would like to share with you 2 simple tips about how to leverage Webz.io to promote your website, product or service organically.

Webz.io Tips & Tricks: Search for Reviews

Webz.io Tips & Tricks: Search for Reviews

Are you looking to focus your data search specifically on consumer generated reviews? Here are a couple of simple Webz.io tricks that might help: Limit your query to specific sites You can…

Vertical Aggregation and Pattern Matching Crawlers

Vertical Aggregation and Pattern Matching Crawlers

After bashing various crawling techniques, I would like to describe the technique we use here, at Webz.io, a technology that was developed over the past 8 years. Our crawlers were developed with…

Crawling Horrors – Computer Vision Crawlers

Crawling Horrors – Computer Vision Crawlers

So if RSS Crawlers are bad, Browser Scraping isn’t efficient, what about computer vision web-page analyzers? This technology uses machine learning and computer vision to extract information from web pages by interpreting…