phantom.py: A lean replacement for bulky headless browser frameworks

This is a simple but fully scriptable headless QtWebKit browser using PyQt5 in Python3, specialized in executing external JavaScript and generating PDF files. A lean replacement for other bulky headless browser frameworks. (Source code at end of this post as well as in this github gist)

Usage

If you have a display attached:

If you don’t have a display attached (i.e. on a remote server):

Arguments:

  • <url> Can be a http(s) URL or a path to a local file
  • <pdf-file> Path and name of PDF file to generate
  • [<javascript-file>] (optional) Path and name of a JavaScript file to execute

Features

  • Generate a PDF screenshot of the web page after it is completely loaded.
  • Optionally execute a local JavaScript file specified by the argument <javascript-file> after the web page is completely loaded, and before the PDF is generated.
  • console.log’s will be printed to stdout.
  • Easily add new features by changing the source code of this script, without compiling C++ code. For more advanced applications, consider attaching PyQt objects/methods to WebKit’s JavaScript space by using  QWebFrame::addToJavaScriptWindowObject().

If you execute an external <javascript-file>, phantom.py has no way of knowing when that script has finished doing its work. For this reason, the external script should execute  console.log("__PHANTOM_PY_DONE__"); when done. This will trigger the PDF generation, after which phantom.py will exit. If no  __PHANTOM_PY_DONE__ string is seen on the console for 10 seconds, phantom.py will exit without doing anything. This behavior could be implemented more elegantly without console.log’s but it is the simplest solution.

It is important to remember that since you’re just running WebKit, you can use everything that WebKit supports, including the usual JS client libraries, CSS, CSS @media types, etc.

Dependencies

  • Python3
  • PyQt5
  • xvfb (optional for display-less machines)

Installation of dependencies in Debian Stretch is easy:

Finding the equivalent for other OSes is an exercise that I leave to you.

Examples

Given the following file /tmp/test.html:

… and the following file /tmp/test.js:

… and running this script (without attached display) …

… you will get a PDF file /tmp/out.pdf with the contents “foo bar baz”.

Note that the second occurrence of “foo” has been replaced by the web page’s own script, and the third occurrence of “foo” by the external JS file.

Source Code

 

, , ,

Trackbacks/Pingbacks

  1. PhantomJS alternative: Write short PyQt scripts: phantom.py - Michael Franzl - October 16, 2017

    […] script is published on my blog and as a github gist. I don’t expect to get 20000 github stars like other bulky headless […]

  2. PhantomJS alternative: Write short PyQt scripts instead (phantom.py) - Michael Franzl - October 16, 2017

    […] script is published on my blog and as a github gist. I don’t expect to get 20000 github stars like others, but honestly, […]

Leave a Reply

Powered by WordPress. Designed by Woo Themes