How to Automate Supply Chain Risk Reports: A Guide for Developers
Do you use Python? If so, this guide will help you automate supply chain risk reports using AI Chat GPT and our News API.
Well, that’s a misleading title. We actually quadrupled the performance of our brand monitoring alert system that uses Elasticsearch’s Percolator, but that would have been a much longer title.
Buzzilla has two main products. The first is Webz.io which provides businesses worldwide access to structured data from the open web, and the second is the leading brand monitoring system in Israel.
Although Israel is a small country, Israelis usually create complex queries that puts a lot of stress on our servers (this could also be attributed to the complexities of the Hebrew language, but that’s for another post). One of the most popular features of the system, is its ability to send push notifications (usually by email) when a post matches a Boolean query.
As I mentioned, we use the Elasticsearch Percolator to register our queries (about 3,500 of them) and run each post we crawl against them. We run about 1 million posts a day against those queries and when they match they are sent to our clients. The system is distributed and uses RabbitMQ to pull posts from our crawlers queue.
We made some optimizations in the past, where we didn’t run the Boolean query against a document if we knew beforehand it wouldn’t match. We did that by comparing some properties of the query and the document. For example, if the language didn’t match, there was no need to check the rest of the query.
At our old configuration, we were able to run about 30 documents per minute against all of our queries per server (and a strong one). As the volume of crawled data and the number of queries grew, we began to have a problem keeping up, at times causing delays of a few hours between crawl time and alert match. We found ourselves adding more and more hardware to try and solve the problem.
What did the trick was to create pre-percolation process, that concatenates multiple posts and runs the queries against the concatenated string (you of course must remove the Boolean NOT clause of the query, I will explain why later on). If there is no match, then great, you just saved time checking each individual post, if there is a match, then bummer, you wasted time checking the concatenated string. Fortunately the former is much more frequent than the latter.
So now I will explain why it worked. Let’s take two phrases, or posts as an example:
The combined text would be: “The quick brown fox jumps over the lazy dog This is a quick example since I’m lazy”
It’s obvious that a query that didn’t match the combined text wouldn’t match its children. So by running the query once against a long chunk of text, we didn’t need to run it against two shorter chunks of text. If on the other hand it did match, we would then need to run it against each post to see which query matched which post. But even then we know which query matched and we wouldn’t have to run all the queries again on each post.
So why is running a query against a large chunk of text faster than running it against two short chunks of text? That’s because we run the query against the index, and the size of the index of the concatenated texts is smaller than the size of each posts index combined:
SizeOfIndex(Post A + Post B) < SizeOfIndex(Post A) + SizeOfIndex(Post B)
Why stop at two posts combined? Why not 100? You can, and should of course concatenate more than two posts, but be careful and remember that once a query matches the concatenated text, you actually wasted resources, as you now need to query against each post (or do a binary search). You want to reach a balance point where your chances to not match are much greater, as on that point your system will be optimized.
I mentioned earlier that you must remove the Boolean NOT clause of the query. If you don’t remove it, you might miss relevant posts. Let’s take the query “quick -example” and run it against the above concatenated text, this of course won’t match as the keyword “example” exists in the text, but it should have matched since the first post matched the query.
That’s it. The solution takes more memory as we are now running two percolators (pre-percolator and the actual alert percolator), but it’s 4 times faster! Hooray!
Do you use Python? If so, this guide will help you automate supply chain risk reports using AI Chat GPT and our News API.
Use this guide to learn how to easily automate supply chain risk reports with Chat GPT and news data.
A quick guide for developers to automate mergers and acquisitions reports with Python and AI. Learn to fetch data, analyze content, and generate reports automatically.