How to Build a Web Scraper using JavaScript
Node.js, Async/Await and Headless Browsers
--
If you want to collect data from the web, you’ll come across a lot of resources teaching you how to do this using more established back-end tools like Python or PHP. But there’s a lot less guidance out there for the new kid on the block, Node.js.
Thanks to Node.js, JavaScript is a great language to use for a web scraper: not only is Node fast, but you’ll likely end up using a lot of the same methods you’re used to from querying the DOM with front-end JavaScript. Node.js has tools for querying both static and dynamic web pages, and it is well-integrated with lots of useful APIs, node modules and more.
In this article, I’ll walk through a powerful way to use JavaScript to build a web scraper. We’ll also explore one of the key concepts useful for writing robust data-fetching code: asynchronous code.
Asynchronous Code
Fetching data is often one of the first times beginners encounter asynchronous code. By default, JavaScript is synchronous, meaning that events are executed line-by-line. Whenever a function is called, the program waits until the function is returned before moving on to the next line of code.
But fetching data generally involves asynchronous code. Such code is removed from the regular stream of synchronous events, allowing the synchronous code to execute while the asynchronous code waits for something to occur: fetching data from a website, for example.
Combining these two types of execution — synchronous and asynchronous — involves some syntax which can be confusing for beginners. We’ll be using the async
and await
keywords, introduced in ES7. They’re syntactic sugar on top of ES6’s Promise syntax, and this — in turn — is syntactic sugar on top of the previous system of callbacks.
Passed-in Callbacks
In the days of callbacks, we were reliant on placing every asynchronous function within another function, leading to what’s sometimes known as the ‘pyramid of doom’ or ‘callback hell’. The example below is on the simple side!
/* Passed-in Callbacks */
doSomething(function(result) {
doSomethingElse(result…