Building html parser can be quite time consuming and often complicated. parsel is a python library for parsing html with css or xpath selectors, while parselcli is an interactive shell wrapper around it.
Disclaimer: I wrote parselcli
parselcli allows you to do this:
At the moment version 0.31
parselcli has these features:
$ parsel "http://httpbin.org/html" -c h1 ['<h1>Herman Melville - Moby-Dick</h1>']
Supports output processors:
$ parsel "http://httpbin.org/html" -c h1::text -p join,len 27 # the h1 text is 27 characters long!
Supports commands for fast workflow and debuging:
$ parsel "http://httpbin.org/html" > -help available commands (use -command): help: show help debug: show debug info embed: start interactive python shell open: open current url in browser tab view: open current html in browser tab fetch: download from new url css: switch to css selectors xpath: switch to xpath selectors
Fully configurable via
~/.config/parsel.toml file (depending on your
$ parsel "http://httpbin.org/html" --embed >>> dir() ['request', 'response', 'sel'] >>> response.cookies <RequestsCookieJar>
parselcli to have faster and more convenient workflow for building html parsers.
First thing I do is find a product/item that has the higher coverage. For example if I'm crawling a clothing shop I look for a product that has all of the fields like variants, colours, sizes and multiple prices. Shoes in often meet this criteria.
This will be my genesis html that I will use to build my parser.
I pass this url to my
parsel and cache it in case I need to run it again in the future.
$ parsel "http://httpbin.org/html" --cache using cached version >
-view command to open up source in my browser or
-open command to open up live url in my browser.
In my browsers (chromium or qutebrowser) I open up inspector and click around html code and identify my css selectors if they are possible, xpath if something more complicated is required.
I pop my ideas to parsel shell and see how it looks.
> h1::text ['Herman Melville - Moby-Dick']
If I'm satisfied I save the selectors in my crawler code and move on.
Quite simple really!
Parselcli is still in rather early development but I've been using it in production for a while now. Nevertheless any contributions are welcome!