How SiteJuggler's Crawler Works

The crawler of SiteJuggler is split into two modes. A subscribed and non-subscribed mode. The subscribed mode is used for when a page is crawled for a site that has a license. Non-subscribed is used for pages that come out of crawled pages from the former mode.

Subscribed sites

This mode also follows the robots.txt standard. Where there are three type of user agents looked for.

  1. *: a wildcard user agent is used as a default check
  2. siteJuggler: this agent is used as an general agent for site juggler
  3. siteJuggler[id]: this agent can be used to only allow a specific site with a license to be used for crawling. Currently there is no support for specific page blocking so it is all or nothing

Blocking SiteJuggler but allowing other crawlers

If you want to block SiteJuggler from crawling your website but still allow any other crawlers the following robots.txt can be used;

user-agent: *
Allow: /
user-agent: siteJuggler
Disallow: /

Block others

When you don't want to allow any one to use SiteJuggler for your website, but you still want to use SiteJuggler that option is also available. In your app look in your site settings and find the robots.txt to allow your site. It will look like the following

user-agent: siteJuggler
Disallow: /
user-agent: siteJuggler[723720c0-d6de-4220-9456-1422a2e20d60]
Allow: /

Non subscribed sites

Currently there is no option for non subscribed sites to not be crawled. However links that are not part of a subscribed site are only crawled once in thirty days.

Pages crawled for non subscribed are also not followed with the exception for redirects. This is done so it can be checked the redirect actually comes out into an actual page.