How to serve AJAX pages (Ember.js, Angular, etc) to Google's dummy bots?

鲜于谦
2023-12-01

There may / must be better ways but here we go.

Recipes: 

  1. Headless browser component, e.g.
    1. PhantomJS [1] is the preferred case because it's 1) v8, 2) lightweight, comparing to the next choince
    2. Firefox + Xvfb. I had to use this one because my site breaks under PhantomJS (even if it works fine under Chrome)
  2. Selenium to drive the browser and generate the HTML.
  3. Web server that serves the bots.
As defined by Google [2], AJAX apps should use #! to indicate the bots that it's an AJAX page, and bots will try to look for ?_escaped_fragment_= URL for this AJAX address and expect a JavaScript-free page. So there must be something to run the JavaScripts, generate proper DOM for the dummy bots. Here comes in the headless browsers.

Xvfb is a special X server that runs (at least for me) on Linux and requires no interaction with graphics devices. It renders everything inside memory so can be run on headless servers like Amazon EC2 Linux servers easily. Firefox is the de facto for Linux, works pretty well with Xvfb, and is the default driver for Selenium so it's the definite choice.

Selenium was designed for browser based test automation. It can drives different browsers starting Firefox (with built-in support), Chrome and IE (both require extra "driver"s). In Python there's an API for Selenium but also there are easier APIs like Splinter, which is my choice.

If simply forwarding every URL to the firefox, we're loading a page 20 - 100x slower than actually loading in Firefox, because for each resources (CSS, JavaScript, Images) the server is actually starting a new Firefox tab (if not window) to retrieve that, while the first AJAX page would have loaded them once already. This is slow, so a hack is done here to load all static resources via Requests instead. Better optimisations available, though.

So everything together is at [3].

Good luck.

[1] PhantomJS  http://phantomjs.org 

 类似资料: