Hello there! This is the first post on crawl.blog! For some more info about the author feel free to check out about page but to keep long story short - I'm a guy who loves to write web-crawlers in Python and would love to share things that I know and things that I learn with the community!

What's Does the Future Hold?

I already have few ideas based on stackoverflow experience. The first few blog posts will focus on general crawling with python, few laters ones on reverse engineering web pages and after that some more advanced subjects like javascript execution, avoiding bans, crawling distribution and speed and AI based crawling.

I'm really exciting on brushing up my knowledge for these articles as some of the more exciting cases aren't encountered very often. I also hope this blog will push me to learn things that I don't enjoy about web crawling and learn alternative frameworks and approaches to my current crawling stack.

My Favorites

My current favorite web crawling projects are originating from scrapinghub and despite not working there anymore I do contribute to:

  1. scrapy - is the django of web-crawling community. It's a bit old and can get a bit messy sometimes but it's feature rich, well designed and easy to extend. It is still my favorite crawling framework and I don't think I'll be retiring it any time soon!

  2. parsel - lxml is de facto standard way to parse html in Python (or pretty much anything else really) but the python wrapper is not very user friendly. Parsel is a beautiful wrapper to make lxml much more user friendly. Think of it as requests compared to urllib!

    fun fact: I came up with the name for this one :)

  3. splash - is a javascript rendering service. While I'm a firm believer that any website on the web can be reverse engineered and javascript can be ported to native python or sub-processed, Splash is still often used in my stack is often a perfect lazy solution!

  4. treq - while Python as asynchronous language has been getting better and better I still often feel that callback driven async programming feels just more natural than coroutines. It might be my just my upbringing with scrapy's Twisted though.

  5. click - is a beautiful cli framework. Crawlers just like any processes need some sort of front-end for easy management and click works surprisingly well! My favorite features are progress bars that make dumping and post-processing data a pure pleasure!

     $ crawler dump from-db to-json alexa --limit 1000
     dumping alexa.com  [β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘]  505/1000   50%  0d 00:00:44
    

Indirect Subjects

I'd also would love to cover web-crawling as a political subject and try dispel this stigma pushed by big corporations that crawling is evil.

There's a huge logical fallacy being enforced that data on the web is public but only when we want it to be. That some groups want to have their cake and eat it too - have their data public and crawled by search engines but not by anyone else.

One of the most famous cases is the cancerous case of Linkedin - a company that built their whole business on unethically crawling everything ranging from public data to private phone contacts are notoriously hostile towards anyone who crawls Linkedin itself.

LinkedIn Takes Data Scraping Fight to Ninth Circuit

Contributors

If anyone would like to contribute check out gitlab repo of this blog or contact me via email - all of the details are in the website's footer.

Thanks for reading and expect more to come!