Tag Archives: reddot

Canonical URLs and SEO

As I recently made a foolish mistake, I thought I would share it to help others avoid it in the future.  It was to do with my quest to get certain pages of the Solution Exchange Community platform indexed in Google, Bing, and Yahoo etc.  Specifically, the valuable forum threads.

First of all, it is worth mentioning how these threads are delivered.  The forum itself is an object of the OpenText Social Communities (OTSC) product, which interacts with the Delivery Server through the OTSC XML API.

Therefore, the forum thread pages are dynamically delivered with the shell of the page being the same physical page with the content influenced by parameters.  In this case, I’ve chosen to utilise sensible URL structures that contain the parameters for simplification and SEO.  I mention more about this in this forum post.  The use of rewrite rules in this way for SEO is one of the key values of a Front Controlling Web Server.

As the shell of the page is the same, I initially had the same <title> tag for all threads and thought that this was the problem.  After changing to adapt the <title> value to the title of the forum thread (along with waiting for re-indexing to happen) there was no change.

Finally, through checking the index of Solution Exchange on Bing with a “site:” search, I noticed to my surprise that one of the threads was indexed but was associated with the URL http://www.solutionexchange.info/forum.htm!!!  This was strange due to the fact that externally, the forum thread was only accessible through a URL like http://www.solutionexchange.info/forum/thread/{ID} meaning that I must be explicitly telling the search engines the wrong URL.  

This was the clue I needed to realise that my problem was due to something I had implemented many months before.

To address the potential SEO penalty that the home page of the community was able to be reached through http://www.solutionexchange.info/ and http://www.solutionexchange.info/index.htm, I introduced the use of the following html header link tag – the example below is the home page value but I included this across the whole site:

<link rel="canonical" href="http://www.solutionexchange.info/index.htm" />

You can read more about this on the Official Google Webmaster Central Blog.  In summary, it tells the search engines that this page is to be associated with the given URL and page ranking (or “Google juice”) is to be associated with that and not the entry URL that the crawler bot used.  This avoids the possibility of page ranking for the same page being split across two or more URLs or being penalised for duplicating content across multiple URLs.

With this knowledge, I was able to update the page template that houses this dynamic content to form the correct URL within this canonical link.  Now it’s back to the waiting game to see if the indexes will pick the content and forgive me for positioning different pages as one.

Although a small detail, the end goal and potential gain is huge as it opens up the rich content that continues to grow within the forum for discovery via the big search engines.  This in turn will only help those within the wider community who are not aware of Solution Exchange discover the content, which may help them resolve an issue or encourage them to take part in the community platform moving forwards.

As always, leave a comment or get in touch if you have any questions.

IIS7, Tomcat & Application Request Routing

Further Update: 27nd June 2011

Another update on this topic. If you were making the use of custom error pages in IIS7 and you implemented the below update, you may have noticed that the custom error commands are no longer being adhered to. To change this, you need to set up custom error pages at a site level by choosing your site, selecting “Error Pages”, then “Edit Feature Settings” from action menu and then “Custom error pages”.

Important Update: 22nd June 2011

On page 2 of this article (How To Configure IIS 7.0 and Tomcat with the IIS ARR Module), there is a key step that I failed to observe when I wrote the original post below.  The step in question is the enablement of the (reverse) proxy server after the ARR install.  By doing this, you are able to apply rewrite rules at the site level — something I wasn’t able to achieve originally, which meant that the routing rules within my server farm were somewhat overloaded.

With this setting enabled, I can leave a single delegation rewrite rule at the server farm level, telling IIS to delegate HTTP requests of a certain pattern but leave the rewrite rules that are there for beautification at the desired site level.  This is a much tidier and more scalable approach.

One gotcha that you need to be aware of is that the rewrites at the site level need to be absolute URLs.  Therefore, you could be tempted to place the host of a single tomcat instance that lay behind IIS direct in here and it would work fine but why not allow for a little future proofing and use localhost within all absolute URL site level rewrites, which isolates the rewrites used for masking ugly application URLs and delegates the job of request delegation to the server farm?  This approach would allow for the server farm config to be used to bring other tomcat instances online or taken offline for maintenance etc without having to change the site level configuration.  In other words, it keeps the various areas of the IIS7 interface focused on the job in hand allowing for easier administration.

Please keep this update in mind as you read the otherwise unchanged original post below.

Regards,

Dan

After many years of using the Tomcat Connector (http://tomcat.apache.org/connectors-doc/) when setting up Tomcat behind IIS, it is now time to say goodbye.

This is the conclusion that I’ve come to after having some particularly significant challenges using IIS7 on a 64bit Windows 2008 machine.

The traditional approach I’ve used in the past has been to utilise the Tomcat Connector, which is implemented as an ISAPI Filter, to delegate requests from IIS through to Tomcat.  This has worked great for me in the past and was the subject of a previous article (http://bit.ly/lp6zW) but the 64bit system threw in a couple of additional challenges that weren’t so easy to get around.

The problems faced led me to discover Application Request Routing (ARR), an official extension for IIS7, which allows you to define the delegation of requests to servers sitting behind the IIS instance.

What is particularly nice with this extension is the way in which it facilitates the former approach within the GUI, making it easier to understand what is being delegated.  The approach however, is similar to the ISAPI filter approach – delegating based on URL path patterns.

The following takes you through an overview of how to set this up:

1. Install ARR

You can obtain the appropriate install for the ARR IIS7 extension at http://www.iis.net/download/applicationrequestrouting

Once installed, the ‘Server Farms’ node indicates that it has installed correctly as indicated in the picture below.

ARR Install

The Server Farms node is seen if ARR is installed correctly

A number of  modules are added as part of this extension.  You can find the details of these from the same ARR link (http://www.iis.net/download/applicationrequestrouting)

2. Create Server Farm

Although the concept of a ‘farm’ of servers may be overkill for our needs of delegating HTTP requests through ISS7 to Tomcat, we shall never the less set up a farm containing one server – our Tomcat instance.

To do this:

  1. highlight the ‘Server Farms’ node in the left panel of the IIS7 Management Console .
  2. Choose ‘Create Server Farm’ from the right hand side action menu.
  3. You will be prompted for a name for the farm.  For my  needs in setting up the Open Text Delivery Server behind IIS7, I gave the farm the name ‘Tomcat – Delivery Server’.ARR Server Farm Name
  4. You will then be prompted to set up a server in the farm.  In our case, we are just going to select the localhost instance of Tomcat running on port 8080. To specify the port, open the ‘Advanced settings’.  Strangely, there appears to be no easy way to edit a servers port once set up so make sure you are correct, otherwise you will have to delete and add a new server.

    ARR Add Server

    Make sure you open the Advanced settings to edit the port number

3. Configure the Routing Rules

Now that we have informed IIS7 about the server that sits behind, we need to let it know how we wish to delegate HTTP requests to it.  To do this, we choose the newly created Server Farm in the left hand panel and select the Routing Rules feature.ARR Routing RulesWithin here, we have a few options.  I’ve chosen to keep the defaults of having both checkboxes checked and have no exclusions set as I am delegating this responsibility to the URL Rewrite Rules.

From here, you can add and modify the rewrite rules defining how requests are delegated using the ‘URL Rewite’ link in the right-hand action panel.

In my case, I chose to change the default rule that was set up for me to a regular expression as opposed to the wildcard default.  However, I only chose this due to personal preference.  The pattern I used for this rule is:

cps(.+)

and I ignore the case.

Finally, I have no Conditions or Server Variables to take note of in my scenario although they can easily be added here, so I conclude the rule by setting the action to ‘Route to Server Farm’ and chose my ‘Tomcat – Delivery Server’ farm with a path setting of

/{R:0}

This passes all URL path info through to Tomcat.  I also choose to stop processing of subsequent rules

4. Refine Rules for your Environment

Lastly, in my setup, I’ve added the following further rules to refine how my site is served through IIS7:

Delegate .htm and .html requests:

Pattern - ([^/]+.html?)
Action path - /cps/rde/xchg/<project>/default.xsl/{R:1}

Delegate .xml requests:

Pattern - ([^/]+.xml?)
Action path - /cps/rde/xchg/<project>/default.xsl/{R:1}

Delegate default home page

Pattern - ^/?$
Action path - /cps/rde/xchg/<project>/default.xsl/index.htm

Summary

Although this approach of using IIS7 in a reverse proxy capacity may not benefit from the efficiencies of the AJP protocol used by the Tomcat Connector, the impact in most sites will be negligible.  In exchange, you have a way of Tomcat and IIS7 working together in a way where the GUI of the IIS7 Management Console helps admins define and understand what is happening.  The ISAPI Filter approach is often not so visible because of the broad nature of what ISAPI modules can provide but also due to the configuration required outside of the IIS7 Management Console.

As always, if you have any questions, leave a comment.

Open Text Delivery Server with a Front Controlling Web Server

Overview

This post discusses the best practice of deploying the Open Text Delivery Server in an optimal way alongside a front controlling web server.

Delivery Server is a dynamic web server component that has strengths in coarse grained personalisation and dynamic behaviour as well as system integration.  Therefore, as it is housed within a Servlet Container, it is not the ideal location from which to serve static content (unless you wish to maintain a level of access control over the static content).

Leveraging the use of a front controlling Web Server, facilitates an optimised site deployment as web servers such as Microsoft’s IIS or Apache’ HTTP Server can be utilised for delivering static content in an optimised way.  For example, it is possible to easily configure a far future ‘Expires’ header on a given folder and therefore its content within either Apache or IIS, which promotes the caching of content in a user’s browser, which reduces page load times.  Another example is in the use of mature compression features within such web servers.  Although these examples can be achieved with some Servlet Container’s, it is certainly not straight forward and doesn’t necessarily make sense from an architectural perspective.

It is for this architectural reason, that best-practice dictates that we delegate only the relevant HTTP requests to Delivery Server.  In most cases, this means that Delivery Server is delegated requests for .htm and .xml resources.  The rest can be served from the front controlling web server (or better still a CDN).

This article provides a high-level overview of what to set up.  Depending on feedback, I may post further posts on the details of each step.

Delegating Requests from the Web Server to Delivery Server

This step can be easily achieved using the Tomcat Connector for both IIS and Apache. To find out more see the Tomcat Connector documentation here: http://bit.ly/at1w8G.

This connector uses the Apache JServ Protocol, which connects to port 8009 by default on Tomcat and is optimised to use a single connection between the Web Server and the Delivery Server for many HTTP requests.  Therefore, this represents a better option than using reverse proxy functionality within the Web Server.

If we take a typical Delivery Server install (i.e. the reference install using Tomcat), a page can be accessed with something like the following URL:

http://<host>:8080/cps/rde/xchg/<project>/<xsl_stylesheet>/<resource>

where resource could be any text based file like index.html or action.xml.

The result of correctly installing the Tomcat Connector means that we can access that same resource but through the Web Server on port 80 and not direct to the Tomcat instance on port 8080:

http://<host>/cps/rde/xchg/<project>/<xsl_stylesheet>/<resource>

Many confuse this step with URL rewriting or redirecting as the Tomcat Connector is often called the Jakarta Redirector.  Therefore, I choose to differentiate by saying that this delegates HTTP requests between the two systems and nothing more.

In every install, I have always used the defaults in the workers.properties file and just used the following rule in the uriworkermap.properties file:

/cps/*=wlb

URL Rewriting

Due to the effort of setting up delegation, deciding which HTTP requests should be forwarded to Delivery Server is a simple matter of performing some URL rewrites.

As we have decided to use a mature Web Server, there are best practice ways to achieve this.  In IIS6, HeliconTech (http://bit.ly/bgJEF6) created a very useful ISAPI filter which ports the widely adopted Apache mod_rewrite (http://bit.ly/cfvuLD) functionality.  For both of these, the same rewrite rules can be used.  The following provides a couple of typical examples:

# Default landing page redirect
RewriteRule ^/$ /cps/rde/xchg/<project>/<xsl_stylesheet>/index.htm [L]
# Rewrite to delegate all *.html or *.htm HTTP requests to Delivery Server
RewriteRule ^/?.*/(.+.html?)$ /cps/rde/xchg/<project>/<xsl_stylesheet>/$1 [L]
# Rewrite to delegate all *.xml HTTP requests to Delivery Server
RewriteRule ^/?.*/(.+.xml)$ /cps/rde/xchg/<project>/<xsl_stylesheet>/$1 [L]

Those of you who are well versed in regular expressions will see that the last two rules could be combined but I tend to leave them separate to aid readability.

The beauty of using regular expressions in this way is that you can actually create useful SEO benefits to your site also. Take for example the following rule:

RewriteRule ^/?.*/([0-9a-zA-Z_]+)$ /cps/rde/xchg/<project>/<xsl_stylesheet>/$1.htm [L]

This rule maps a URL with many apparent subdirectories to the Delivery Server file.  This means that you can publish a page with a “virtual” path within the Management Server which appears to a browser (and search engines) as something like the following:

http://<host>/this/is/a/descriptive/directory/structure/page.htm

and yet this maps to:

/cps/rde/xchg/<project>/<xsl_stylesheet>/page.htm

IIS7

Being a Microsoft product, IIS7 has some quirks with regards to the rewriting (of course), which I explained in a previous post: http://bit.ly/lp6zW.

Summary

This approach has led to many successful installations where sites could additionally be optimised for SEO and page load.