Transforming hyperlinks when copying websites
Recently a website I infrequently use was badly defaced, and in the course of repairing the damage the owners of the site temporarily took it down. As I found it to be a very useful resource I lamented not having an offline copy and so when the site was restored, I decided to make a copy without further ado.
However, as I swiftly discovered, that was a problem - the site used JavaScript for many internal links, and WebCopy doesn't support JavaScript. Somewhat fortunately, when I looked at how the JavaScript links functioned, I discovered they were all of a predicable nature - a call to a single function with two string arguments. The destination URL was a simple concatenation of these arguments with no extra processing.
Although WebCopy is our most popular product, it actually got started accidentally as an offshoot of Sitemap Creator to make a copy of a long forgotten website. And the reason for bringing up that trivia is that right from the start Sitemap Creator had a feature where it could transform page titles to remove the extra text these typically have. This functionality has now been re-purposed to allow WebCopy to intercept a URI at the detection stage and transform it into something different.
What can you use it for?
The initial use case is to transform values from one form into another in a predicable fashion, for example to remove calling an interim page or to handle very simple JavaScript.
How do you use it?
You can find the configuration settings in the URI Transforms section of the Project Properties dialog.
Each replacement is comprised of a Pattern, a Replacement and an optional URI. The pattern is a regular expression which is used to both match the source link and define any result groups. Replacement is another expression which defines how the URI is transformed. Finally, URI can be used to only perform the pattern matching on links belonging to a given URI.
Regular expressions are a vast and complicated topic and it would be nice if WebCopy didn't depend so much on them - they don't make WebCopy very easy to use in many respects. WebCopy does include a basic editor for expressions which can be quite handy for testing patterns and replacements but it could be improved.
Usage Scenario: Cutting out the middle man
One use case is for cutting out an interim page. For example, one page may ultimately link to another, but it does this by first calling an interim page with a query string argument describing the destination. The interim page will perform some action (such as logging the "click", showing a timed advert, etc.) and then navigate to the destination. By using a transform, we can manipulate the URL to discard the interim page and just go directly to the destination, remapping the source link appropriately.
The WebCopy demonstration page includes an example of this
behaviour. The Middleman Redirect link will navigate to
redirecttracker.php?url=uritransformfinal.php
. We can use a
simple pattern to strip out the bulk of the URL and just keep
the query string parameter.
- Pattern:
redirecttracker\.php\?url=(.*)
- Replacement:
$1
This pattern matches redirecttracker.php?url=
and captures
everything after in a group. The replacement then simply outputs
the contents of the group, uritransformfinal.php
in the above
example.
If you tell WebCopy to crawl the demonstration site without the
above transform, it will find uritransformfinal.php
, but the
source page will still point to the original redirection page.
With the above transform in place, WebCopy will never know that
redirecttracker.php
exists - it will skip directly the final
page.
Usage Scenario: Converting simple JavaScript links
For a more advanced example, the demonstration page also
has three hyperlinks with the following href
attributes
javascript:openPage('1', 'index')
javascript:openPage('1', 'second')
javascript:openPage('2', 'index')
Clicking the first link will navigate to 1-index.php
, the
second to 1-second.php
and the third to 2-index.php
. While
not really best practice for modern websites, this mirrors the
behaviour of original site I wanted to copy.
If you do a normal scan using WebCopy, while it will detect all
three links above, it will silently ignore them. To get WebCopy
to correctly process these links we need to detect the calls to
the openPage
function, and construct a replacement URI using
the two parameters, plus an extension.
This can be done with the following transform
- Pattern:
javascript:openPage\('(.*)',\s?'(.*)'\)
- Replacement:
$1-$2.php
The above expression will first try and match
javascript:openPage('
(braces are escaped with \
as they are
special characters). It will then capture any characters between
the first set of single quotes into a capture group. After the
closing quote, it will then match a ,
character. The \s
token means to match any white space character, and the ?
makes it optional. Next the pattern captures any characters
between the second set of single quote characters and matches a
closing brace. In fairness, the pattern could be simplified
further, but then it would look even more confusing to newcomers
so I've tried to keep it more explicit.
The replacement expression basically combines the two groups
(the $1
and $2
tokens represent capture groups from the
pattern) with a -
between them and then adding the .php
extension.
Now when scanning the demonstration website, WebCopy will find those three links and automatically transform them, therefore finding and downloading the linked pages. At the end of the copy, when WebCopy remaps downloaded HTML to ensure links are local to the copy, it will also replace the source links with the transformed name.
Getting the build
Currently this functionality is only available in nightly builds, available from the WebCopy download page.
Update History
- 2017-05-29 - First published
- 2020-11-23 - Updated formatting
Leave a Comment
While we appreciate comments from our users, please follow our posting guidelines. Have you tried the Cyotek Forums for support from Cyotek and the community?