Using Classic ASP and URL Rewrite for Dynamic SEO Functionality

I had another interesting situation present itself recently that I thought would make a good blog: how to use Classic ASP with the IIS URL Rewrite module to dynamically generate Robots.txt and Sitemap.xml files.

Overview

Here's the situation: I host a website for one of my family members, and like everyone else on the Internet, he wanted some better SEO rankings. We discussed a few things that he could do to improve his visibility with search engines, and one of the suggestions that I gave him was to keep his Robots.txt and Sitemap.xml files up-to-date. But there was an additional caveat - he uses two separate DNS names for the same website, and that presents a problem for absolute URLs in either of those files. Before anyone points out that it's usually not a good idea to host multiple DNS names on the same content, there are times when this is acceptable; for example, if you are trying to decide which of several DNS names is the best to use, you might want to bind each name to the same IP address and parse your logs to find out which address is getting the most traffic.

In any event, the syntax for both Robots.txt and Sitemap.xml files is pretty easy, so I wrote a couple of simple Classic ASP Robots.asp and Sitemap.asp pages that output the correct syntax and DNS-specific URLs for each domain name, and I wrote some simple URL Rewrite rules that rewrite inbound requests for Robots.txt and Sitemap.xml files to the ASP pages, while blocking direct access to the Classic ASP pages themselves.

All of that being said, there are a couple of quick things that I would like to mention before I get to the code:

  • First of all, I chose Classic ASP for the files because it allows the code to run without having to load any additional framework; I could have used ASP.NET or PHP just as easily, but either of those would require additional overhead that isn't really required.
  • Second, the specific website for which I wrote these specific examples consists of all static content that is updated a few times a month, so I wrote the example to parse the physical directory structure for the website's URLs and specified a weekly interval for search engines to revisit the website. All of these options can easily be changed; for example, I reused this code a little while later for a website where all of the content was created dynamically from a database, and I updated the code in the Sitemap.asp file to create the URLs from the dynamically-generated content. (That's really easy to do, but outside the scope of this blog.)

That being said, let's move on to the actual code.

Creating the Required Files

There are three files that you will need to create for this example:

  1. A Robots.asp file to which URL Rewrite will send requests for Robots.txt
  2. A Sitemap.asp file to which URL Rewrite will send requests for Sitemap.xml
  3. A Web.config file that contains the URL Rewrite rules

Step 1 - Creating the Robots.asp File

You need to save the following code sample as Robots.asp in the root of your website; this page will be executed whenever someone requests the Robots.txt file for your website. This example is very simple: it checks for the requested hostname and uses that to dynamically create the absolute URL for the website's Sitemap.xml file.

<%
    Option Explicit
    On Error Resume Next
    
    Dim strUrlRoot
    Dim strHttpHost
    Dim strUserAgent

    Response.Clear
    Response.Buffer = True
    Response.ContentType = "text/plain"
    Response.CacheControl = "public"

    Response.Write "# Robots.txt" & vbCrLf
    Response.Write "# For more information on this file see:" & vbCrLf
    Response.Write "# http://www.robotstxt.org/" & vbCrLf & vbCrLf

    strHttpHost = LCase(Request.ServerVariables("HTTP_HOST"))
    strUserAgent = LCase(Request.ServerVariables("HTTP_USER_AGENT"))
    strUrlRoot = "http://" & strHttpHost

    Response.Write "# Define the sitemap path" & vbCrLf
    Response.Write "Sitemap: " & strUrlRoot & "/sitemap.xml" & vbCrLf & vbCrLf

    Response.Write "# Make changes for all web spiders" & vbCrLf
    Response.Write "User-agent: *" & vbCrLf
    Response.Write "Allow: /" & vbCrLf
    Response.Write "Disallow: " & vbCrLf
    Response.End
%>

Step 2 - Creating the Sitemap.asp File

The following example file is also pretty simple, and you would save this code as Sitemap.asp in the root of your website. There is a section in the code where it loops through the file system looking for files with the *.html file extension and only creates URLs for those files. If you want other files included in your results, or you want to change the code from static to dynamic content, this is where you would need to update the file accordingly.

<%
    Option Explicit
    On Error Resume Next
    
    Response.Clear
    Response.Buffer = True
    Response.AddHeader "Connection", "Keep-Alive"
    Response.CacheControl = "public"
    
    Dim strFolderArray, lngFolderArray
    Dim strUrlRoot, strPhysicalRoot, strFormat
    Dim strUrlRelative, strExt

    Dim objFSO, objFolder, objFile

    strPhysicalRoot = Server.MapPath("/")
    Set objFSO = Server.CreateObject("Scripting.Filesystemobject")
    
    strUrlRoot = "http://" & Request.ServerVariables("HTTP_HOST")
    
    ' Check for XML or TXT format.
    If UCase(Trim(Request("format")))="XML" Then
        strFormat = "XML"
        Response.ContentType = "text/xml"
    Else
        strFormat = "TXT"
        Response.ContentType = "text/plain"
    End If

    ' Add the UTF-8 Byte Order Mark.
    Response.Write Chr(CByte("&hEF"))
    Response.Write Chr(CByte("&hBB"))
    Response.Write Chr(CByte("&hBF"))
    
    If strFormat = "XML" Then
        Response.Write "<?xml version=""1.0"" encoding=""UTF-8""?>" & vbCrLf
        Response.Write "<urlset xmlns=""http://www.sitemaps.org/schemas/sitemap/0.9"">" & vbCrLf
    End if
    
    ' Always output the root of the website.
    Call WriteUrl(strUrlRoot,Now,"weekly",strFormat)

    ' --------------------------------------------------
    ' This following section contains the logic to parse
    ' the directory tree and return URLs based on the
    ' static *.html files that it locates. This is where
    ' you would change the code for dynamic content.
    ' -------------------------------------------------- 
    strFolderArray = GetFolderTree(strPhysicalRoot)

    For lngFolderArray = 1 to UBound(strFolderArray)
        strUrlRelative = Replace(Mid(strFolderArray(lngFolderArray),Len(strPhysicalRoot)+1),"\","/")
        Set objFolder = objFSO.GetFolder(Server.MapPath("." & strUrlRelative))
        For Each objFile in objFolder.Files
            strExt = objFSO.GetExtensionName(objFile.Name)
            If StrComp(strExt,"html",vbTextCompare)=0 Then
                If StrComp(Left(objFile.Name,6),"google",vbTextCompare)<>0 Then
                    Call WriteUrl(strUrlRoot & strUrlRelative & "/" & objFile.Name, objFile.DateLastModified, "weekly", strFormat)
                End If
            End If
        Next
    Next

    ' --------------------------------------------------
    ' End of file system loop.
    ' --------------------------------------------------     
    If strFormat = "XML" Then
        Response.Write "</urlset>"
    End If
    
    Response.End

    ' ======================================================================
    '
    ' Outputs a sitemap URL to the client in XML or TXT format.
    ' 
    ' tmpStrFreq = always|hourly|daily|weekly|monthly|yearly|never 
    ' tmpStrFormat = TXT|XML
    '
    ' ======================================================================

    Sub WriteUrl(tmpStrUrl,tmpLastModified,tmpStrFreq,tmpStrFormat)
        On Error Resume Next
        Dim tmpDate : tmpDate = CDate(tmpLastModified)
        ' Check if the request is for XML or TXT and return the appropriate syntax.
        If tmpStrFormat = "XML" Then
            Response.Write " <url>" & vbCrLf
            Response.Write " <loc>" & Server.HtmlEncode(tmpStrUrl) & "</loc>" & vbCrLf
            Response.Write " <lastmod>" & Year(tmpLastModified) & "-" & Right("0" & Month(tmpLastModified),2) & "-" & Right("0" & Day(tmpLastModified),2) & "</lastmod>" & vbCrLf
            Response.Write " <changefreq>" & tmpStrFreq & "</changefreq>" & vbCrLf
            Response.Write " </url>" & vbCrLf
        Else
            Response.Write tmpStrUrl & vbCrLf
        End If
    End Sub

    ' ======================================================================
    '
    ' Returns a string array of folders under a root path
    '
    ' ======================================================================

    Function GetFolderTree(strBaseFolder)
        Dim tmpFolderCount,tmpBaseCount
        Dim tmpFolders()
        Dim tmpFSO,tmpFolder,tmpSubFolder
        ' Define the initial values for the folder counters.
        tmpFolderCount = 1
        tmpBaseCount = 0
        ' Dimension an array to hold the folder names.
        ReDim tmpFolders(1)
        ' Store the root folder in the array.
        tmpFolders(tmpFolderCount) = strBaseFolder
        ' Create file system object.
        Set tmpFSO = Server.CreateObject("Scripting.Filesystemobject")
        ' Loop while we still have folders to process.
        While tmpFolderCount <> tmpBaseCount
            ' Set up a folder object to a base folder.
            Set tmpFolder = tmpFSO.GetFolder(tmpFolders(tmpBaseCount+1))
              ' Loop through the collection of subfolders for the base folder.
            For Each tmpSubFolder In tmpFolder.SubFolders
                ' Increment the folder count.
                tmpFolderCount = tmpFolderCount + 1
                ' Increase the array size
                ReDim Preserve tmpFolders(tmpFolderCount)
                ' Store the folder name in the array.
                tmpFolders(tmpFolderCount) = tmpSubFolder.Path
            Next
            ' Increment the base folder counter.
            tmpBaseCount = tmpBaseCount + 1
        Wend
        GetFolderTree = tmpFolders
    End Function
%>

Note: There are two helper methods in the preceding example that I should call out:

  • The GetFolderTree() function returns a string array of all the folders that are located under a root folder; you could remove that function if you were generating all of your URLs dynamically.
  • The WriteUrl() function outputs an entry for the sitemap file in either XML or TXT format, depending on the file type that is in use. It also allows you to specify the frequency that the specific URL should be indexed (always, hourly, daily, weekly, monthly, yearly, or never).

Step 3 - Creating the Web.config File

The last step is to add the URL Rewrite rules to the Web.config file in the root of your website. The following example is a complete Web.config file, but you could merge the rules into your existing Web.config file if you have already created one for your website. These rules are pretty simple, they rewrite all inbound requests for Robots.txt to Robots.asp, and they rewrite all requests for Sitemap.xml to Sitemap.asp?format=XML and requests for Sitemap.txt to Sitemap.asp?format=TXT; this allows requests for both the XML-based and text-based sitemaps to work, even though the Robots.txt file contains the path to the XML file. The last part of the URL Rewrite syntax returns HTTP 404 errors if anyone tries to send direct requests for either the Robots.asp or Sitemap.asp files; this isn't absolutely necesary, but I like to mask what I'm doing from prying eyes. (I'm kind of geeky that way.)

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
  <system.webServer>
    <rewrite>
      <rewriteMaps>
        <clear />
        <rewriteMap name="Static URL Rewrites">
          <add key="/robots.txt" value="/robots.asp" />
          <add key="/sitemap.xml" value="/sitemap.asp?format=XML" />
          <add key="/sitemap.txt" value="/sitemap.asp?format=TXT" />
        </rewriteMap>
        <rewriteMap name="Static URL Failures">
          <add key="/robots.asp" value="/" />
          <add key="/sitemap.asp" value="/" />
        </rewriteMap>
      </rewriteMaps>
      <rules>
        <clear />
        <rule name="Static URL Rewrites" patternSyntax="ECMAScript" stopProcessing="true">
          <match url=".*" ignoreCase="true" negate="false" />
          <conditions>
            <add input="{Static URL Rewrites:{REQUEST_URI}}" pattern="(.+)" />
          </conditions>
          <action type="Rewrite" url="{C:1}" appendQueryString="false" redirectType="Temporary" />
        </rule>
        <rule name="Static URL Failures" patternSyntax="ECMAScript" stopProcessing="true">
          <match url=".*" ignoreCase="true" negate="false" />
          <conditions>
            <add input="{Static URL Failures:{REQUEST_URI}}" pattern="(.+)" />
          </conditions>
          <action type="CustomResponse" statusCode="404" subStatusCode="0" />
        </rule>
        <rule name="Prevent rewriting for static files" patternSyntax="Wildcard" stopProcessing="true">
          <match url="*" />
          <conditions>
             <add input="{REQUEST_FILENAME}" matchType="IsFile" />
          </conditions>
          <action type="None" />
        </rule>
      </rules>
    </rewrite>
  </system.webServer>
</configuration>

Summary

That sums it up for this blog; I hope that you get some good ideas from it.

For more information about the syntax in Robots.txt and Sitemap.xml files, see the following URLs:


Note: This blog was originally posted at http://blogs.msdn.com/robert_mcmurray/

Bad Characters to Use in Web-based Filenames

My good friend Wade Hilmo recently posted an excellent blog titled "How IIS blocks characters in URLs" that discusses some of the internal workings for how IIS deals with several characters in file names that do not work well in URLs. Wade’s blog does a great job explaining all of the internal IIS URL parsing logic in detail, and his post reminded me of some related notes that I had published on an internal website at Microsoft. As a complement to Wade’s outstanding blog post, I’m reposting my notes in this blog.

Recently a Microsoft Technical Account Manager (TAM) mentioned that he was working on an issue with a customer that was using SharePoint 2007 on IIS 7. The customer's problem was this: his company had several Word documents that were stored in SharePoint that had the plus sign (+) in the filenames, and HTTP requests for these documents were failing. The TAM configured IIS 7's request filtering feature to allow doubly-escaped characters by setting the allowDoubleEscaping attribute to true. This seemed to alleviate the problem, but I had to point out that this probably wasn't the right thing to do. As a general rule, I don't like changing many of the default configuration options for the IIS 7 request filtering feature, because they are designed to keep my servers safe. But in this specific scenario, modifying those settings is simply looking in the wrong place.

Let me explain:

There are several characters that are perfectly valid in a Windows filename that are really bad when you post those files on websites, and either the server or the client could wreak havoc with them. In most scenarios the HTTP requests will receive an HTTP 404 File Not Found error, but in some cases that might cause an HTTP 400 Bad Request error. As such, even though you might find a way to work around the problem, it's a really bad idea to use those characters when you are posting files to a website.

RFC 2396 is the governing document for "Uniform Resource Identifiers (URI): Generic Syntax." This RFC defines what can and can't be used in a URI, as well as what shouldn't be used.

First, section "2.2. Reserved Characters" contains the following list of reserved characters:

reserved = ";" | "/" | "?" | ":" | "@" |
           "&" | "=" | "+" | "$" | ","

Second, section "2.4.3. Excluded US-ASCII Characters" contains the following lists of delimiter and unwise characters:

delims = "<" | ">" | "#" | "%" | <">

unwise = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"

Several of the characters in those lists cannot be used in Windows filenames, but the remaining characters should not be used for filenames if you intend to upload those files to a website.

Here are my explanations for why some of those characters will cause problems if you attempt to use them in filenames that you upload to a website:

  • Plus Sign (+) - this character is often translated as a URI-encoded space, so the URI "http://localhost/foo+bar.doc" could be misinterpreted as the URI "http://localhost/foo bar.doc".
  • Percent Sign (%) - this character is used for URI escaping, and I've seen this cause a lot of problems because the two characters that follow the percent sign are assumed to be hex digits for an escaped ASCII code, so the URI "http://localhost/foo%bar.doc" could be misinterpreted as the URI "http://localhost/fooºr.doc".
  • Number/Pound Sign (#) - this character is used to delimit a URI from a fragment identifier (aka bookmarks), so the URI "http://localhost/foo#bar.doc" could be misinterpreted as the URI "http://localhost/foo" with a bookmark of "bar.doc".

So once again, just because you might be able to get this to work on your server doesn't mean that you should be using a character in a web-based filename that's reserved for something else. It's like building an atomic bomb - just because you can doesn't mean that you should. Your best answer in this scenario is to rename your files to something else.

;-]


Note: This blog was originally posted at http://blogs.msdn.com/robert_mcmurray/