Web Scraping with NodeJS and PhantomJS

February 7, 2018
What is Web Scraping?

For those of you, who did not hear what Web Scraping is before, it is pulling data straight out of raw HTML, as opposite to API, where data is ready for you to take. Getting phone numbers from sales websites or firstnames of friend list on Facebook may be good examples of it. There is a lot of tasks on Upwork.com related to web scraping similar to:

«Get data from Airbnb. All the info about houses & apartments in London. Price, characteristics, pictures, etc...»

Apparently, it is a good skill to have under your belt. All you need to do is to pick up necessary tools.

Why PhantomJS?

We can split web scraping into two major steps and there is a variety of tools for both.

  • Getting HTML (simple http request, selenium, PhantomJS, etc)
  • Parsing HTML (jQuery, cheerio, nokogiri, etc)

It is crucial to choose appropriate tool for the first step, whilst the second is more of a personal choice.

PhantomJS is a browser, but without graphical interface (headless browser). It is possible to imitate user actions: mouse clicks, submitting forms, etc. But for web scraping the major feature is the ability to execute client js code. This gives us the opportunity to get HTML after initialization of all jQuery plugins & front-end frameworks (React, Angular), the HTML that is actually seen by users.

This is the main difference between PhantomJS and simple http request, which just returns the HTML sent by server. We can compare HTML content from Parabola.io received by both methods.

<!-- simple http request -->
<html>

<head>
  <meta charset="utf-8">
  <title>Parabola</title>
  ...
</head>

<body>
  <div id="main-view"></div>
  <script type="text/javascript" src="/static/app/js/main.a03ec4deadf36fbf17db.js"></script>
</body>

</html>
<!-- PhantomJS -->
<html class="wf-inactive wf-proximanova-n6-inactive wf-proximanova-n4-inactive wf-proximanova-n5-inactive wf-adelle-n4-inactive">

<head>
  <meta charset="utf-8">
  <title>Parabola says...</title>
  ...
</head>

<body>
  <div id="main-view">
    <div data-reactroot="" data-radium="true">
      <div data-radium="true">
        ...
      </div>
    </div>
  </div>
  <script type="text/javascript" src="/static/app/js/main.a03ec4deadf36fbf17db.js"></script>
  <iframe name="stripeXDM_default815431_provider" id="stripeXDM_default815431_provider" src="https://js.stripe.com/v2/channel.html?stripe_xdm_e=https%3A%2F%2Fparabola.io&amp;stripe_xdm_c=default815431&amp;stripe_xdm_p=1#__stripe_transport__"
    frameborder="0" id="wistia-57p3clyq0q-1_popover_popover_close_button" class="wistia_placebo_close_button">
    ...
  </iframe>
</body>

</html>

I think the point is clear. Now we can scrape websites which heavily rely on front-end frameworks like React, Angular or Ember as well as typical websites.

Where do we start?

The goal of this article is to show minimalistic example of web scraping with PhantomJS. I am not going to cover all details, such as submitting forms or clicking links, because this is the first article in upcoming series and because I don't think we should do it manually with PhantomJS. There are a lot of libraries based on PhantomJS, which relieve us from this tedious work (Nightwatch, Casper). Next articles are going to be focused on them.

First of all, we need to make sure NodeJS is installed, and then initialize the project.

// check NodeJS version, 7.1.0 in my case
node -v

// initialize project
npm init . --yes

// install necessary libs
npm i phantomjs-prebuilt cheerio --save

PhantomJS is not a library for NodeJS, but separate environment. Code written for PhantomJS can be incompatible with NodeJS. The library we installed (phantomjs-prebuilt) is just a npm wrapper, which downloads PhantomJS, unzips it & makes it available to work with in NodeJs.

The major part of «project« consists of two files - scrapers/index.js & scrapers/phantom-script.js. Whole project directory looks like that.

scrape_with_phantomjs
  |-- main.js
  +-- scrapers
    |-- index.js
    |-- phantom-script.js
  |-- package.json
  |-- node_modules

File phantom-scripts.js is the one being executed by PhantomJS. All browser specific actions are written here.

// scrapers/index.js

const path = require("path");
const childProcess = require("child_process");

// path to PhantomJS bin
const phantomJsPath = require("phantomjs-prebuilt").path;

exports.fetch = function(url, reject, resolve) {
  // execute phantom-script.js file via PhantomJS
  const childArgs = [path.join(__dirname, "phantom-script.js")];
  const phantom = childProcess.execFile(phantomJsPath, childArgs, {
    env: {
      URL: url
    },
    maxBuffer: 2048 * 1024
  });

  let stdout = "";
  let stderr = "";

  // data comes gradually, bit by bit
  phantom.stdout.on("data", function(chunk) {
    stdout += chunk;
  });

  phantom.stderr.on("data", function(chunk) {
    stderr += chunk;
  });

  phantom.on("uncaughtException", function(err) {
    console.log("uncaught exception: " + err);
  });

  phantom.on("exit", function(exitCode) {
    if (exitCode !== 0) {
      return reject(stderr);
    }

    resolve(stdout);
  });
};

File index.js is more like a middleware between PhantomJS & our project. It runs PhantomJS in a separate process, gives path to phantom-scripts.js file & receives data.

// scrapers/phantom-script.js

var system = require("system");
var env = system.env;
var page = require("webpage").create();

page.settings.userAgent =
  "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36";

// default viewport size is small, change it to 1366x768
page.viewportSize = {
  width: 1366,
  height: 768
};

// open page
page.open(env.URL, function(status) {
  if (status == "success") {
    // wait until all the assets are loaded
    function checkReadyState() {
      var readyState = page.evaluate(function() {
        return document.readyState;
      });

      if (readyState == "complete") {
        var result = page.evaluate(function() {
          return document.documentElement.outerHTML;
        });

        // exit and return HTML
        system.stdout.write(result);
        phantom.exit(0);
      } else {
        setTimeout(checkReadyState, 50);
      }
    }

    checkReadyState();
  } else {
    // if status is not 'success' exit with an error
    system.stderr.write(error);
    phantom.exit(1);
  }
});

File main.js is an entry point. We pass URL and decide what we want to do with received HTML. In this case just print the page title to console.

const cheerio = require("cheerio");
const { fetch } = require("./phantom/index.js");

const URL = "https://parabola.io/";

fetch(
  URL,
  error => {
    console.log(error);
  },
  html => {
    const $ = cheerio.load(html);
    const title = $("title").text();

    console.log(title);
  }
);

Finally, all you need to do is to start main.js.

node main.js

And it’s done! The application code is pretty short. I hope this gives you a rough idea if you would like to get started in web scraping with PhantomJS.

If you liked this article and think others should read it, please share it on Twitter.