GRAWL - A sitemap crawler to check page health

Preface

In my current job, we were upgrading a major pimcore multi-site application with hundreds of pages and multiple languages. After the natural steps for update was completed, I needed to check anything was broken because of the updates to third party dependencies. But with thousands of pages to look at, there was no easy way of doing it. It would take days just to visit each page and look at the response. So, we came up with a brilliant idea of first generating the sitemaps and then crawl the sitemaps to find if there were non OK http responses or specific keywords like Error, Exception or some fragments of stack traces like /var/www/app/ in the response body. Of course this would not fix our problems, neither would give us an excuse to not do quality assurance, but it surely would limit the scope on which we needed to look for issues. Additionally, since we would be actually visiting these pages, we would get errors in Sentry and this would make our work much easier.

Birth of GRAWL

Grawl is just a wordplay for a crawler written in golang, which by the way wasn't the first language I implemented the crawler in. I needed a quick script so I wrote the entire crawler in a PHP script and ran the script. However, PHP isn't really a language for these kind of tasks. On my first iteration of this project with PHP, it crawled about 1500 pages in almost 30 minutes (in dev server) which was painful to wait. It was of course better than manually going through each of those pages (which would take days to finish for sure), but it wasn't good enough.

PHP being a blocking language would crawl one page at a time and it took forever to complete the crawling process. We needed the ability to run the crawler concurrently such that it could crawl multiple pages at a time. This is when I went to golang.

I had not seriously tried golang before, but knew fair bit about the syntax and I had heard good things about goroutines and how you could spawn millions of goroutines with very less overhead. I also thought about writing it in rust, but rust would have made my life hell when I had to deal with async stuff like network calls as well as multi threading. So I finally decided to re-write my scrapy scraper in golang.

The first iteration

On the first iteration, I implemented the crawler in a single file. I did not use any concurrency features, just plain old loops and http calls and it was still faster than what I got with PHP. I forgot the exact numbers, but it was faster. Once it was working, I adjusted the code to handle the crawling via goroutines concurrently. I was spawning thousands of goroutines and making it work and it blew my mind. The entire site with over 2000 pages crawled in under a minute 🤯.

Final Thoughts

After this huge success, I decided to cleanup everything and create a proper cli tool that I could open source. I used cobra to create the cli tool, gave it a proper structure and then open sourced it in github.

If you feel like you could benefit from this tool, give it a try and let me know your thoughts.

Until next time,

GRAWL - A sitemap crawler to check page health

Preface

Birth of GRAWL

The first iteration

Final Thoughts

Comments

Leave a comment