How to Build a Web Scraper using JavaScript

Node.js, Async/Await and Headless Browsers

Bret Cameron

--

A different kind of scraper… (Image Credit: Jannes Glas / Unsplash)

If you want to collect data from the web, you’ll come across a lot of resources teaching you how to do this using more established back-end tools like Python or PHP. But there’s a lot less guidance out there for the new kid on the block, Node.js.

Thanks to Node.js, JavaScript is a great language to use for a web scraper: not only is Node fast, but you’ll likely end up using a lot of the same methods you’re used to from querying the DOM with front-end JavaScript. Node.js has tools for querying both static and dynamic web pages, and it is well-integrated with lots of useful APIs, node modules and more.

In this article, I’ll walk through a powerful way to use JavaScript to build a web scraper. We’ll also explore one of the key concepts useful for writing robust data-fetching code: asynchronous code.

Asynchronous Code

Fetching data is often one of the first times beginners encounter asynchronous code. By default, JavaScript is synchronous, meaning that events are executed line-by-line. Whenever a function is called, the program waits until the function is returned before moving on to the next line of code.

But fetching data generally involves asynchronous code. Such code is removed from the regular stream of synchronous events, allowing the synchronous code to execute while the asynchronous code waits for something to occur: fetching data from a website, for example.

Combining these two types of execution — synchronous and asynchronous — involves some syntax which can be confusing for beginners. We’ll be using the async and await keywords, introduced in ES7. They’re syntactic sugar on top of ES6’s Promise syntax, and this — in turn — is syntactic sugar on top of the previous system of callbacks.

Passed-in Callbacks

In the days of callbacks, we were reliant on placing every asynchronous function within another function, leading to what’s sometimes known as the ‘pyramid of doom’ or ‘callback hell’. The example below is on the simple side!

/* Passed-in Callbacks */
doSomething(function(result) {
doSomethingElse(result

--

--

Bret Cameron

Writer and developer based in London. On Medium, I write about JavaScript and web development 💻