Hello there! This is the first post on crawl.blog! For some more info about the author feel free to check out about page but to keep long story short - I'm a guy who loves to write web-crawlers in Python and would love to share things that I know and things that I learn with the community!
What's Does the Future Hold?
I'm really exciting on brushing up my knowledge for these articles as some of the more exciting cases aren't encountered very often. I also hope this blog will push me to learn things that I don't enjoy about web crawling and learn alternative frameworks and approaches to my current crawling stack.
My current favorite web crawling projects are originating from scrapinghub and despite not working there anymore I do contribute to:
scrapy - is the
djangoof web-crawling community. It's a bit old and can get a bit messy sometimes but it's feature rich, well designed and easy to extend. It is still my favorite crawling framework and I don't think I'll be retiring it any time soon!
lxmlis de facto standard way to parse html in Python (or pretty much anything else really) but the python wrapper is not very user friendly. Parsel is a beautiful wrapper to make lxml much more user friendly. Think of it as
fun fact: I came up with the name for this one :)
treq - while Python as asynchronous language has been getting better and better I still often feel that callback driven async programming feels just more natural than coroutines. It might be my just my upbringing with scrapy's Twisted though.
click - is a beautiful cli framework. Crawlers just like any processes need some sort of front-end for easy management and click works surprisingly well! My favorite features are progress bars that make dumping and post-processing data a pure pleasure!
$ crawler dump from-db to-json alexa --limit 1000 dumping alexa.com [▓▓▓▓▓▓▓▓▓▓▓░░░░░░░░░░] 505/1000 50% 0d 00:00:44
I'd also would love to cover web-crawling as a political subject and try dispel this stigma pushed by big corporations that crawling is evil.
There's a huge logical fallacy being enforced that data on the web is public but only when we want it to be. That some groups want to have their cake and eat it too - have their data public and crawled by search engines but not by anyone else.
One of the most famous cases is the cancerous case of Linkedin - a company that built their whole business on unethically crawling everything ranging from public data to private phone contacts are notoriously hostile towards anyone who crawls Linkedin itself.
If anyone would like to contribute check out gitlab repo of this blog or contact me via email - all of the details are in the website's footer.
Thanks for reading and expect more to come!