For those of you, who did not hear what Web Scraping is before, it is pulling data straight out of raw HTML, as opposite to API, where data is ready for you to take. Getting phone numbers from sales websites or firstnames of friend list on Facebook may be good examples of it. There is a lot of tasks on Upwork.com related to web scraping similar to:
«Get data from Airbnb. All the info about houses & apartments in London. Price, characteristics, pictures, etc...»
Apparently, it is a good skill to have under your belt. All you need to do is to pick up necessary tools.
We can split web scraping into two major steps and there is a variety of tools for both.
It is crucial to choose appropriate tool for the first step, whilst the second is more of a personal choice.
PhantomJS is a browser, but without graphical interface (headless
browser). It is possible to imitate user actions: mouse clicks,
submitting forms, etc. But for web scraping the major feature is the
ability to execute client js code. This gives us the opportunity to
get HTML after initialization of all jQuery plugins & front-end
frameworks (React, Angular), the HTML that is actually seen by users.
This is the main difference between PhantomJS and simple http request, which just returns the HTML sent by server. We can compare HTML content from Parabola.io received by both methods.
<!-- simple http request -->
<html>
<head>
<meta charset="utf-8">
<title>Parabola</title>
...
</head>
<body>
<div id="main-view"></div>
<script type="text/javascript" src="/static/app/js/main.a03ec4deadf36fbf17db.js"></script>
</body>
</html>
<!-- PhantomJS -->
<html class="wf-inactive wf-proximanova-n6-inactive wf-proximanova-n4-inactive wf-proximanova-n5-inactive wf-adelle-n4-inactive">
<head>
<meta charset="utf-8">
<title>Parabola says...</title>
...
</head>
<body>
<div id="main-view">
<div data-reactroot="" data-radium="true">
<div data-radium="true">
...
</div>
</div>
</div>
<script type="text/javascript" src="/static/app/js/main.a03ec4deadf36fbf17db.js"></script>
<iframe name="stripeXDM_default815431_provider" id="stripeXDM_default815431_provider" src="https://js.stripe.com/v2/channel.html?stripe_xdm_e=https%3A%2F%2Fparabola.io&stripe_xdm_c=default815431&stripe_xdm_p=1#__stripe_transport__"
frameborder="0" id="wistia-57p3clyq0q-1_popover_popover_close_button" class="wistia_placebo_close_button">
...
</iframe>
</body>
</html>
I think the point is clear. Now we can scrape websites which heavily rely on front-end frameworks like React, Angular or Ember as well as typical websites.
The goal of this article is to show minimalistic example of web scraping with PhantomJS. I am not going to cover all details, such as submitting forms or clicking links, because this is the first article in upcoming series and because I don't think we should do it manually with PhantomJS. There are a lot of libraries based on PhantomJS, which relieve us from this tedious work (Nightwatch, Casper). Next articles are going to be focused on them.
First of all, we need to make sure NodeJS is installed, and then initialize the project.
// check NodeJS version, 7.1.0 in my case
node -v
// initialize project
npm init . --yes
// install necessary libs
npm i phantomjs-prebuilt cheerio --save
PhantomJS is not a library for NodeJS, but separate environment. Code written for PhantomJS can be incompatible with NodeJS. The library we installed (phantomjs-prebuilt) is just a npm wrapper, which downloads PhantomJS, unzips it & makes it available to work with in NodeJs.
The major part of «project« consists of two files - scrapers/index.js & scrapers/phantom-script.js. Whole project directory looks like that.
scrape_with_phantomjs
|-- main.js
+-- scrapers
|-- index.js
|-- phantom-script.js
|-- package.json
|-- node_modules
File phantom-scripts.js is the one being executed by PhantomJS. All browser specific actions are written here.
// scrapers/index.js
const path = require("path");
const childProcess = require("child_process");
// path to PhantomJS bin
const phantomJsPath = require("phantomjs-prebuilt").path;
exports.fetch = function(url, reject, resolve) {
// execute phantom-script.js file via PhantomJS
const childArgs = [path.join(__dirname, "phantom-script.js")];
const phantom = childProcess.execFile(phantomJsPath, childArgs, {
env: {
URL: url
},
maxBuffer: 2048 * 1024
});
let stdout = "";
let stderr = "";
// data comes gradually, bit by bit
phantom.stdout.on("data", function(chunk) {
stdout += chunk;
});
phantom.stderr.on("data", function(chunk) {
stderr += chunk;
});
phantom.on("uncaughtException", function(err) {
console.log("uncaught exception: " + err);
});
phantom.on("exit", function(exitCode) {
if (exitCode !== 0) {
return reject(stderr);
}
resolve(stdout);
});
};
File index.js is more like a middleware between PhantomJS & our project. It runs PhantomJS in a separate process, gives path to phantom-scripts.js file & receives data.
// scrapers/phantom-script.js
var system = require("system");
var env = system.env;
var page = require("webpage").create();
page.settings.userAgent =
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36";
// default viewport size is small, change it to 1366x768
page.viewportSize = {
width: 1366,
height: 768
};
// open page
page.open(env.URL, function(status) {
if (status == "success") {
// wait until all the assets are loaded
function checkReadyState() {
var readyState = page.evaluate(function() {
return document.readyState;
});
if (readyState == "complete") {
var result = page.evaluate(function() {
return document.documentElement.outerHTML;
});
// exit and return HTML
system.stdout.write(result);
phantom.exit(0);
} else {
setTimeout(checkReadyState, 50);
}
}
checkReadyState();
} else {
// if status is not 'success' exit with an error
system.stderr.write(error);
phantom.exit(1);
}
});
File main.js is an entry point. We pass URL and decide what we want to do with received HTML. In this case just print the page title to console.
const cheerio = require("cheerio");
const { fetch } = require("./phantom/index.js");
const URL = "https://parabola.io/";
fetch(
URL,
error => {
console.log(error);
},
html => {
const $ = cheerio.load(html);
const title = $("title").text();
console.log(title);
}
);
Finally, all you need to do is to start main.js.
node main.js
And it’s done! The application code is pretty short. I hope this gives you a rough idea if you would like to get started in web scraping with PhantomJS.