parselcli - css and xpath selectors in your terminal!

Building html parser can be quite time consuming and often complicated. parsel is a python library for parsing html with css or xpath selectors, while parselcli is an interactive shell wrapper around it.

Disclaimer: I wrote parselcli

In short parselcli allows you to do this:

At the moment version 0.31 parselcli has these features:

  • css and xpath selectors
  • Auto complete for classes, ids and tags with history (for ptpython shell)
  • Evaluating against cached, live urls or local files (see --cached and --file flags)
  • Inline executions:

    $ parsel "" -c h1
    ['<h1>Herman Melville - Moby-Dick</h1>']
  • Supports output processors:

    $ parsel "" -c h1::text -p join,len
    # the h1 text is 27 characters long!
  • Supports commands for fast workflow and debuging:

    $ parsel ""
    > -help
    available commands (use -command):
      help: show help
      debug: show debug info
      embed: start interactive python shell
      open: open current url in browser tab
      view: open current html in browser tab
      fetch: download from new url
      css: switch to css selectors
      xpath: switch to xpath selectors
  • Fully configurable via ~/.config/parsel.toml file (depending on your $XDG_CONFIG)

  • Preload embed shell for full python
    $ parsel "" --embed
    >>> dir()
    ['request', 'response', 'sel']
    >>> response.cookies

My Workflow

I build parselcli to have faster and more convenient workflow for building html parsers.

First thing I do is find a product/item that has the higher coverage. For example if I'm crawling a clothing shop I look for a product that has all of the fields like variants, colours, sizes and multiple prices. Shoes in often meet this criteria.
This will be my genesis html that I will use to build my parser.

I pass this url to my parsel and cache it in case I need to run it again in the future.

$ parsel "" --cache
using cached version

Afterwards if the website functions without javascript I use -view command to open up source in my browser or -open command to open up live url in my browser.

In my browsers (chromium or qutebrowser) I open up inspector and click around html code and identify my css selectors if they are possible, xpath if something more complicated is required.
I pop my ideas to parsel shell and see how it looks.

> h1::text
['Herman Melville - Moby-Dick']

If I'm satisfied I save the selectors in my crawler code and move on.
Quite simple really!

Contribution is Welcome!

Parselcli is still in rather early development but I've been using it in production for a while now. Nevertheless any contributions are welcome!