Scrape Websites With NodeJS

In the RC airplane hobby we’ve been blessed and cursed with the rise of inexpensive foreign manufacturing. On the plus side the cost of participating in the hobby has gone down drastically. The actual monetary cost of the materials and components has dropped tremendously. However, on the negative side, the hobby has seen a race to the bottom for prices which has led to dramatic cost cutting measures. When kits and components are produced in such a large amount the proper QA is often thrown away to favor a lower cost to the distributor.

This is also seen with many manufacturers and distributors websites. Often these sites are thrown together by people who are not that technically gifted from templates. The necessary updates both for the software hosting the site and the content on the site can be very hit or miss. Shady payment methods can sometimes be spotted and avoided and the use of intermediary payment services like PayPal can make the aspect of doing business with such places a little less of a headache.

To me, the largest problem is figuring out what is in stock! All the above could be forgiven to some extent if when you ordered an item you knew it was in the warehouse, it shipped, and was in transit to you. Spending time looking through the many planes an overseas manufacturer or an importer says they have can be frustrating when you finally see what you’ve been searching for only to find out that its stock is zero.

I ran into this problem with a site called GeneralHobby.com. They import from overseas many quality models. These models are cheap but I’ve seen them in person and can attest that some of them have great quality for the asking price. Often I would go to the site to browse for something that would catch my interest. Inevitably I’d see something very cool or what looked like a good deal. Clicking on the picture of the item would only lead to a description with an out-of-stock notice. I knew there had to be a better way!

Enter JavaScript

Looking at the site I could see that it had a very logical layout. With these sites you can be almost 100% positive that they’re generated from sort of template driven by a server-side dataset. This being the case I knew it was possible to parse the page and get the information I wanted in a format which is more easier to consume. My bet was that I could use jQuery and CSS selectors to get a list of items on the page and use it to generate a JSON object with information about the site’s stock.

In order to do this I decided to use NodeJS to act as a client and fetch the pages from the server. After fetching each page I would parse the content for data and then look for the “next page”.

GeneralHobby Stock Checker

What I came up with is the GeneralHobby Stock Checker. This NodeJS application does all of the above plus it builds a web page with an AngularJS control to display the JSON dataset.

The repository is split into two sections. In the root is: index.js. This file is the NodeJS portion and is run to generate a JSON datafile. In the app folder is the web page and its supporting files.

NodeJS

The NodeJS portion is responsible for the heavy lifting, getting the data and dumping it to a JSON data file. I used a couple of common modules to do this, jsdom and jquery. JSDom allows for making requests and getting back a reference to the window object. With the window object handed back from the request I could then create a JQuery object to parse the page with CSS selectors.

var $Config = {
	url: $BaseURL,
	scripts: null,
	done: function(errors, window) {
		if (errors) {
			error(errors);
		}
			
		var $ = jquery(window);
		
		var countText = $('DIV#main.col-main FORM TABLE.pager TBODY TD STRONG').text();
		var totalCount = parseInt(countText.substring(0, countText.length - 8), 10);
		
		var listingItems = $('DIV#main.col-main div.listing-item');
		
		log('Processing ' + listingItems.length + ' items');
		
		listingItems.each(function() {
			var itemText = $('DIV.product-shop H5 A:last-child', this).text();
			var pageUrl = $('DIV.product-image A', this).attr('href');
			var imageUrl = $('DIV.product-image A IMG', this).attr('src');
			
			var item = {
				name: itemText,
				soldOut: true,
				imageUrl: 'http://www.generalhobby.com/' + imageUrl,
				pageUrl: pageUrl,
				regularPrice: null,
				specialPrice: null
			};
			
			var regularPriceElement = $('.price-box .regular-price .price', this);
			var specialPriceElement = $('S', regularPriceElement);
			if (specialPriceElement.length) {
				item.regularPrice = specialPriceElement.text();
				item.specialPrice = $('.productSpecialPrice', regularPriceElement).text();
			}
			else {
				item.regularPrice = regularPriceElement.text();
			}			
			
			$JSONData.push(item);
			
			if ($('DIV.product-shop SPAN:contains("Sold Out")', this).length) {
				return;
			}
			else {
				item.soldOut = false;
				$TotalAvailable++;
				log(itemText);
			}
		});
		
		$ProcessedCount += listingItems.length;
		
		bigInfo('Completed processing of ' + $ProcessedCount + ' items!');
		
		if ($ProcessedCount < totalCount) {
			var page = $BaseURL + '?page=' + (++$ProcessedPage).toString() + '&sort=5a';
			log(page);
			$Config.url = page;
			
			jsdom.env($Config);
		}
		else {
			dumpJSONInfo();
			bigInfo('Processed ' + $TotalAvailable + ' available items!');
		}
	}
};

I’ve highlighted the important code.

Line 9 shows the coolness of this whole process. This configuration object is passed to the JSDom call which when finished with its request calls this done method. The window object is then used to create the classic $ jQuery object.

Next we need to get the various information from the page. Line 11 shows how the total count of items is retrieved. This will let us know when we’ve processed all the pages and can dump out the JSON. Line 14 is what finds all the items on the current page. The jQuery array that is returned is used in a foreach loop method to extract the information for each item. The data extraction is shown on lines 19-21. On lines 32 and 36 you can see how the price information is obtained. For many items there may be a sale price. The standard price will always show so we need to look for a special price element to get its text. For this application the most important line is line 44. This checks for the out-of-stock notice! Finally, the lines 59-63 build up the URL for the next page. The $Config object is "global" so the url field is reassigned to the new URL and JSDom is called again. This recursive calling ends when there are no more pages to scrape.

Dumping the JSON is fairly trivial.

var dumpJSONInfo = function() {
	var jsonData = {
		date: new Date(),
		products: $JSONData
	};
	
	fs.writeFile('app/scripts/data.json', 'var $DATA = ' + JSON.stringify(jsonData, null, 4) + ';', function(e) {
		if (e) {
			error(e);
		}
		else {
			log('Data saved!');
		}
	});
};

The file is placed into the web directory’s scripts folder. It is then loaded by the page and passed into the AngularJS repeater.

Web Page

The web page that displays the JSON data is very simple as well.

<div ng-cloak ng-show="data.products">
  <div class="panel panel-default">
    <div class="panel-heading">Filter</div>
    <div class="panel-body">
      <form class="form-inline">
        <div class="form-group">
          <input ng-model="productNameFilter" ng-change="filterChanged" type="text" class="form-control" placeholder="Product name" />
        </div>
        <div class="form-group" style="float: right;">
          <input type="radio" class="form-control" ng-model="sortReverse" ng-value="false">Ascending</input>
          <input type="radio" class="form-control" ng-model="sortReverse" ng-value="true">Descending</input>
        </div>
      </form>
    </div>
  </div>
		  
  <div>
    <b>Total:</b> {{filtered.length}} found on {{data.date | date:'medium'}}
  </div>
		  
  <table class="table table-striped">
    <thead>
      <tr>
        <th>&nbsp;</th>
        <th>Product</th>
        <th>Price</th>
      </tr>
    </thead>
    <tbody>
      <tr ng-repeat="product in filtered = (data.products | filter:filterList) | orderBy:orderProduct:sortReverse">
        <td>
          <a ng-href="{{product.pageUrl}}" target="_blank"><img class="productImage" ng-src="{{product.imageUrl}}" /></a>
        </td>
        <td>
          <a ng-href="{{product.pageUrl}}" target="_blank">{{product.name}}</a>
        </td>
        <td>
          <s ng-show="product.specialPrice">{{product.regularPrice}}</s>
          <span ng-show="product.specialPrice">{{product.specialPrice}}</span>
          <span ng-hide="product.specialPrice">{{product.regularPrice}}</span>
        </td>
      </tr>
    </tbody>
  </table>
		  
  <div>
    <b>Total:</b> {{filtered.length}} found on {{data.date}}
  </div>
</div>

What is occurring here is essentially an ng-repeat with filter and ordering options.

Summary

So all anyone would have to do to use this is clone the repo, npm install the dependencies, run node index.js and then first up a browser pointed to the index.html in the app directory.

SIMPLE! Now I can figure out what is in stock!

About Mike

I'm a software engineer. Look into the about page for more information about me.
Tagged , , , , . Bookmark the permalink.

Leave a Reply