Chronicles of the Afro Kid
Chronicles of the Afro Kid

SEO in AngularJS apps with PhearJS

Whether you like it or not, single page applications (SPA) are all the rage and if you use AngularJS, then SEO has to be one of the main pain points when building an application. Except for Google, makes sense since they built Angular, no other external service can render JavaScript applications properly. If you are using React, you might want to read this too because unless you use server-side rendering, bots will not be able to render your site properly.

There are a few services out there which help with SEO for AngularJS apps - Brombone and Prerender.io. Both of them are paid services. With Prerender.io though, you can host your own version. I have used Prerender.io as a hosted service and tried hosting it on my own servers but it just was not cutting it for me.

After searching for a few hours, I landed on PhearJS which is an open-source scraper. Underneath the hood, it uses the same technology - uses PhantomJS (a headless browser) to render dynamic pages and return the results as either JSON or raw HTML. It uses memcached as a cache, so that subsequent requests for the same pages are fast.

I will not be going into details on how to set up a self-hosted instance in this post. Their documentation does a great job at that. I will be instead walking you through setting up your existing app to use PhearJS to serve your pages as static pages to search engines and likes.

The status page for my test PhearJS instance

The worker section in the status page

PhearJS comes with a status page where you see some important statistics about the number of pages cached and separate information on each individual worker doing the leg work.

Using Express

If you are ExpressJS to power your server, you can use the phearjs-express middleware to take care of bots visiting your site. Any bot tied to popular services will be forwarded to your hosted instance of PhearJS.

To setup,
npm install phearjs-express --save

Then in your express middleware configuration file -

var prerender = require('phearjs-express');  
app.use(prerender({phear_url: 'http://myphearinstance.com'}));  

The phear_url is the fully resolved URL of your hosted PhearJS instance. I'm assuming you are running the service on port 80.

Using NginX

If you use NginX as a reserve proxy for your Node application, I would recommend going down this route because as bots would have it, they pound servers with requests. NginX is better at handling simultaneous requests hitting the server. Unless you can't or don't scale your Node app, the sheer number of requests will take down your server.[1]

To configure your NginX server to forwards requests to your PhearJS instance, open up the configuration file.
vi /etc/nginx/sites-available/mysite.com

In your server block [2],

location / {  
    try_files $uri @prerender;
}

location @prerender {  
     set $prerender 0;

     if ($http_user_agent ~* "facebot|baiduspider|twitterbot|facebookexternalhit|rogerbot|
linkedinbot|embedly|quora link preview|showyoubot|  
outbrain|pinterest|slackbot|vkShare|W3C_Validator") {  
        set $prerender 1;
     }

     if ($args ~ "_escaped_fragment_") {
        set $prerender 0;
     }

     if ($http_user_agent ~ "PhearBot/0.4.1 - http://phear.io") {
        set $prerender 0;
     }

     if ($uri ~ "\.(js|css|xml|less|png|jpg|jpeg|gif|pdf|doc|txt|ico|rss|zip|mp3|rar|exe|wmv|doc|
avi|ppt|mpg|mpeg|tif|wav|mov|psd|ai|xls|mp4|m4a|swf|dat|dmg|iso|flv|m4v|torrent|ttf|woff)") {  
        set $prerender 0;
     }

     #resolve using Google's DNS server to force DNS resolution and prevent caching of IPs
     resolver 8.8.8.8;

     # The main block
     if ($prerender = 1) {
        set $prerender "myphearinstance.com";
        rewrite .* /?fetch_url=$scheme://$host$request_uri&raw=true break;
        proxy_pass http://$prerender;
     }

     if ($prerender = 0) {
        set $proxy_host $host;
     }

     # Proxy headers go here

     if ($prerender = 0) {
        proxy_pass http://localhost:3000;
     }
}

The configuration above, for the most part, is non-trivial. We declared a block to handle routing. The part we want to go over is the part commented The main block. If the requesting agent is a bot, except Google's, we send it to our hosted version of PhearJS. Before we send it over though, we make sure to send the requested URL as the value for the parameter fetch_url and raw set to true because we want to serve the raw HTML to the bot. PhearJS, by default, serves the JSON form of our page.


[1]: This situation is a real-life example. Sometimes you are bound by the number of servers you have and to save your app from crashing when bots decide to say hi, letting NginX handle the traffic from the get go enables me to keep things running without increasing my blood pressure more than it already is.
[2]: Thank you @thoop over at Prerender.io for the original NginX configuration script.