Using the XmlReader class with C#
Some of the project files created by Cyotek Sitemap Creator and
WebCopy are fairly large and the load performance of such files
is poor. The files are saved using a XmlWriter
class which is
nice and fast. When reading the files back however, currently
the whole file is loaded into a XmlDocument
and then XPath
expressions are used to pull out the values. This article
describes our effort at converting the load code to use a
XmlReader
instead.
Sample XML
The following XML snippet can be used as a base for testing the code in this article, if required.
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<cyotek.webcopy.project version="1.0.0.0" generator="Cyotek WebCopy 1.0.0.2 (BETA))" edition="">
<uri lastCrawled="-8589156546443756722" includeSubDomains="false">http://saturn/cyotekdev/</uri>
<additionalUri>
<uri>first url</uri>
<uri>second url</uri>
</additionalUri>
<authentication doNotAskForPasswords="false">
<credential uri="/" userName="username" password="password" />
</authentication>
<saveFolder path="C:\Downloaded Web Sites" emptyBeforeCrawl="true" createFolderForDomain="true" flattenWebsiteDirectories="false" remapExtensions="true" />
<crawler removeFragments="true" followRedirects="true" disableUriRemapping="false" slashedRootRemapMode="1" sort="false" acceptDeflate="true" acceptGZip="true" bufferSize="0" crawlAboveRoot="false" />
<defaultDocuments />
<linkInfo save="true" clearBeforeCrawl="true" />
<stripQueryString>false</stripQueryString>
<useHeaderChecking>true</useHeaderChecking>
<userAgent useDefault="true"></userAgent>
<rules>
<rule options="1" enabled="true">trackback\?id=</rule>
<rule options="1" enabled="false">/downloads/get</rule>
<rule options="1" enabled="false">/article</rule>
<rule options="1" enabled="false">/sitemap</rule>
<rule options="1" enabled="false">image/get/</rule>
<rule options="1" enabled="false">products</rule>
<rule options="1" enabled="false">zipviewer</rule>
</rules>
<domainAliases>
<alias>(?:http(?:s?):\/\/)?saturn/cyotekdev/</alias>
</domainAliases>
<forms>
<page name="" uri="login" enabled="true" method="POST">
<parameters>
<parameter name="rememberMe">true</parameter>
<parameter name="username">username</parameter>
<parameter name="password">password</parameter>
</parameters>
</page>
</forms>
<linkMap>
<link id="b1b85626f9984279b5e033c30a0a3f65" uri="" source="1" contentType="text/html" httpStatus="200" lastDownloaded="-8589156550177150260" hash="0333961593BD555C49ABF2355140225A07DA9297" fileName="index.htm">
<title>Cyotek</title>
<incomingLinks>
<link id="b1b85626f9984279b5e033c30a0a3f65" />
</incomingLinks>
<outgoingLinks>
<link id="96a358d21135449eb6561f25399e24de" />
</outgoingLinks>
<headers>
<header key="Content-Encoding" value="gzip" />
<header key="Vary" value="Accept-Encoding" />
<header key="X-AspNetMvc-Version" value="1.0" />
<header key="Content-Length" value="3415" />
<header key="Cache-Control" value="private" />
<header key="Content-Type" value="text/html; charset=utf-8" />
<header key="Date" value="Fri, 01 Oct 2010 16:51:07 GMT" />
<header key="Expires" value="Fri, 01 Oct 2010 16:51:07 GMT" />
<header key="ETag" value="" />
<header key="Server" value="Microsoft-IIS/7.5" />
<header key="X-Powered-By" value="UrlRewriter.NET 2.0.0" />
</headers>
</link>
</linkMap>
</cyotek.webcopy.project>
Writing XML using a XmlWriter
Before I start discussing how to load the data, here is a quick overview of how it is originally saved. For clarity I'm only showing the bare bones of the method.
string workFile;
workFile = Path.GetTempFileName();
using (FileStream stream = File.Create(workFile))
{
XmlWriterSettings settings;
settings = new XmlWriterSettings { Indent = true, Encoding = Encoding.UTF8 };
using (XmlWriter writer = XmlWriter.Create(stream, settings))
{
writer.WriteStartDocument(true);
writer.WriteStartElement("uri");
if (this.LastCrawled.HasValue)
writer.WriteAttributeString("lastCrawled", this.LastCrawled.Value.ToBinary());
writer.WriteAttributeString("includeSubDomains", _includeSubDomains);
writer.WriteValue(this.Uri);
writer.WriteEndElement();
writer.WriteEndDocument();
}
}
File.Copy(workFile, fileName, true);
File.Delete(workFile);
The above code creates a new temporary file and opens this into
a FileSteam
. A XmlSettings
object is created to specify
some options (by default it won't indent, making the output
files difficult to read if you open then in a text editor), and
then a XmlWriter
is created from both the settings and
stream.
Once you have a writer, you can quickly save data in compliant format, with the caveat that you must ensure that your WriteStarts have a corresponding WriteEnd, that you only have a single document element, and so on.
Assuming the writer gets to the end without any errors, the stream is closed, then temporary file is copied to the final destination before being deleted. (This is a good tip in its own right, as this means you won't destroy the user's existing if an error occurs, which you would if you directly wrote to the destination file.)
Reading XML using a XmlDocument
As discussed above, currently we use a XmlDocument
to load
data. The following snippet shows an example of this.
Note that the code below won't work "out of the box" as we use a number extension methods to handle data type conversion, which makes the code a lot more readable!
document = new XmlDocument();
document.Load(fileName);
_uri = documentElement.SelectSingleNode("uri").AsString();
_lastCrawled = documentElement.SelectSingleNode("uri/@lastCrawled").AsDate();
_includeSubDomains = documentElement.SelectSingleNode("uri/@includeSubDomains").AsBoolean(false);
So, as you can see we load a XmlDocument
with the contents of
our file. We then call SelectSingleNode
several times with a
different XPath expression.
And in the case of a crawler project, we do this a lot, as there is a large amount of information stored in the file.
I haven't tried to benchmark XPath, but I would assume that we
could have optimized this by first getting the appropriate
element (uri in this case) and then run additional XPath to read
text/attributes. But this article would be rather pointless then
as we want to discuss the XmlReader
!
As an example, we have a 2MB project file which represents the
development version of cyotek.com. Using
System.Diagnostics.Stopwatch
we timed how long it took to load
this project 10 times, and it averaged 25seconds per load.
Which is definitely unacceptable.
Reading using a XmlReader
Which brings us to the point of this article, doing the job
using a XmlReader
and hopefully improving the performance
dramatically.
Before we continue though, a caveat:
This is the first time I've tried to use the
XmlReader
class, therefore it is possible this article doesn't take the best approach. I also wrote this article at the same time as getting the reader to work in my application so I've gone back and forth already correcting errors and misconceptions, which at times (and possible still) left the article a little disjointed. If you spot any errors in this article, please let us know
The XmlReader
seems to operate in the same principle as the
XmlWriter
, in that you need to read the data in more or less
the same order as it was written. I suppose the most convenient
analogy is a forward cursor in SQL Server, where you can only
move forward through the records and not back.
Creating the reader
So, first things first - we need to create an object. But the
XmlReader
(like the XmlWriter
) is abstract. Fortunately
exactly like the writer, there is a static Create
method we
can use.
Continuing in the reader-is-just-like-writer vein, there is also
a XmlReaderSettings
class which you can use to fine tune
certain aspects.
Lets get the document opened then. Unlike XmlDocument
where
you just provide a file name, XmlReader
uses a stream.
using (FileStream fileSteam = File.OpenRead(fileName))
{
XmlReaderSettings settings;
settings = new XmlReaderSettings();
settings.ConformanceLevel = ConformanceLevel.Document;
using(XmlReader reader = XmlReader.Create(fileSteam, settings))
{
}
}
This sets us up nicely. Continuing my analogy from earlier, if
you're familiar with record sets, there's usually a MoveNext
or a Read
method you call to read the next record in the set.
The XmlReader
doesn't seem to be different in this respect, as
there's a dedicated Read
method for iterating through all
elements in the document. In addition, there are a number of
other read methods for performing more specific actions.
There is also a NodeType
property which lets you know what the
current node type is, such as the start of an element, or the
end of an element.
I'm going to use the IsStartElement
method to work out if the
current node is the start of an element, then perform processing
based on the element name.
Enumerating elements, regardless of their position in the hierarchy
The following snippet will iterate all nodes and check to see if they are the start of an element. Note that this includes top level elements and child elements.
while (reader.Read())
{
if (reader.IsStartElement())
{
}
}
The Name
property will return the name of the active node. So
I'm going to compare the name against the names written into the
XML and do custom processing for each.
switch (reader.Name)
{
case "uri":
break;
}
Reading attributes on the active element
I mentioned above that there are a number of Read*
methods.
There are also several Move*
methods. The one that caught my
eye is MoveToNextAttribute*,
which I'm going to use for
converting attributes to property values.
The Value
property will return the value of the current node.
If MoveToNextAttribute
returns true
, then I know I'm in a
valid attribute and I can use the aforementioned Name
property
and the Value
property to update property assignments.
The following snipped demonstrates the MoveToNextAttribute
method and Value
property:
while (reader.MoveToNextAttribute())
{
switch (reader.Name)
{
case "lastCrawled":
if (!string.IsNullOrEmpty(reader.Value))
_lastCrawled = DateTime.FromBinary(Convert.ToInt64(reader.Value));
break;
case "includeSubDomains":
if (!string.IsNullOrEmpty(reader.Value))
_includeSubDomains = Convert.ToBoolean(reader.Value);
break;
}
}
This is actually quite a lot of work. Another alternative is to
use the GetAttribute
method - this reads an attribute value
without moving the reader. I found this very handy when I was
loading an object who's identifying property wasn't the first
attribute in the XML block. It also takes up a lot less code
entry.Headers.Add(reader.GetAttribute("key"), reader.GetAttribute("value"));
Reading the content value of an element
I've now got two values out of hundreds in the file loaded and I'm finished with that element. Or am I? Actually I'm not - the original save code demonstrates that in addition to a pair of attributes, we're also saving data directly into to the element.
As we have been iterating attributes, the active node type is
the last attribute, not the original element. Fortunately
there's another method we can use - MoveToContent
. This time
though, we can't use the Value
property. Instead, we'll call
the ReadString
method, giving us the following snippet:
if (reader.IsStartElement() || reader.MoveToContent() == XmlNodeType.Element)
_uri = reader.ReadString();
I've included a call to IsStartElement
in the above snippet as
I found if I called MoveToContent
when I was already on a
content node (for example if no attributes were present), then
it skipped the current node and moved to the next one.
If required, you can call ReadElementContentAsString
instead
of ReadString
.
Some node values aren't strings though - in this case the
XmlReader
offers a number of strongly typed methods to
return and convert the data for you, such as
ReadElementContentAsBoolean
, ReadElementContentAsDateTime
,
etc.
case "useHeaderChecking":
_useHeaderChecking = reader.ReadElementContentAsBoolean();
break;
Processing nodes where the same names are reused for different purposes
In the sample XML document at the start of this article, we have
two different types of nodes named uri
. The top level one
has one purpose, and the children of additionalUri
have
another.
The problem we now face is as we have a single loop which
processes all elements the case statement for uri
will be
triggered multiple times. We're going to need some way of
determining which is which.
There are a few of ways we could do this, for example
- Continue to use the main processing loop, just add a means of identifying which type of element is being processed
- Adding another loop to process the children of the
additionalUri
element - Using the
ReadSubtree
method to create a brand newXmlReader
containing the children and process that accordingly.
As we already have a loop which handles the elements we should probably reuse this - there'll be a lot of duplicate code if we suddenly start adding new loops.
Unfortunately there doesn't seem to an equivalent of the parent
functionality of the XmlDocument
class, the closest thing I
could see was the Depth
property. This returned 1
for the
top level uri
node, and 2
for the child versions. You need
to be careful at what point you read this property, it also
returned 2
when iterating the attributes of the top level
*
ri** node.
One workaround would be to use boolean flags to identify the
type of node you are loading. This would also mean checking to
see if the NodeType
was XmlNodeType.EndElement
, doing
another name comparison, and resetting flags as appropriate.
This might be more reliable (or understandable) than simply
checking node depths, your mileage may vary.
Another alternative could be to combine depth and element start/end in order to push and pop a stack which would represent the current node hierarchy.
In order to get my converted code running, I've went with the boolean flag route. I suspect a future version of the crawler format is going to ensure the nodes have unique names so I don't have to do this hoop jumping again though!
Combined together, the load data code now looks like this:
while (reader.Read())
{
if (reader.IsStartElement())
{
switch (reader.Name)
{
case "uri":
if (!isLoadingAdditionalUris)
{
while (reader.MoveToNextAttribute())
{
switch (reader.Name)
{
case "lastCrawled":
if (!string.IsNullOrEmpty(reader.Value))
_lastCrawled = DateTime.FromBinary(Convert.ToInt64(reader.Value));
break;
case "includeSubDomains":
if (!string.IsNullOrEmpty(reader.Value))
_includeSubDomains = Convert.ToBoolean(reader.Value);
break;
}
}
if (reader.IsStartElement() || reader.MoveToContent() == XmlNodeType.Element)
_uri = reader.ReadString();
}
else if (reader.IsStartElement() || reader.MoveToContent() == XmlNodeType.EndElement)
_additionalRootUris.Add(new Uri(UriHelpers.CombineUri(this.GetBaseUri(), reader.ReadString(), this.SlashedRootRemapMode)));
break;
case "additionalUri":
isLoadingAdditionalUris = true;
break;
}
}
else if (reader.NodeType == XmlNodeType.EndElement)
{
switch (reader.Name)
{
case "additionalUri":
isLoadingAdditionalUris = false;
break;
}
}
}
Which is significantly more code than the original version, and it's only handling a few values.
Using the ReadSubtree Method
The save functionality of crawler projects isn't centralized, child objects such as rules perform their own loading and saving via the following interface:
public interface IXmlPersistance
{
void Write(string fileName, XmlWriter writer);
void Read(string fileName, XmlNode reader);
}
And the current XmlDocument
based code will call it like this:
_rules.Clear();
foreach (XmlNode child in documentElement.SelectNodes("rules/rule"))
{
Rule rule;
rule = new Rule();
((IXmlPersistance)rule).Read(fileName, child);
_rules.Add(rule);
}
None of this code will work now with the switch to use
XmlReader
so it all needs changing. For this, I'll create a
new interface
public interface IXmlPersistance2
{
void Write(string fileName, XmlWriter writer);
void Read(string fileName, XmlReader reader);
}
The only difference is the Read
method is now using a
XmlReader
rather than a XmlNode
.
The next issue is that if I pass the original reader to this interface, the implementer will be able to read outside the boundaries of the element it is supposed to be reading, which could prevent the rest of the document from loading successfully.
We can resolve this particular issue by calling the
ReadSubtree
method which returns a brand new XmlReader
object that only contains the active element and it's children.
This means our other settings objects can happily (mis)use the
passed reader without affecting the underlying load.
Note in the snippet below what we have wrapped the new reader in
a using statement. The MSDN documentation states that the result
of ReadSubtree
should be closed before you continue reading
from the original reader.
Rule rule;
rule = new Rule();
using (XmlReader childReader = reader.ReadSubtree())
((IXmlPersistance2)rule).Read(fileName, childReader);
_rules.Add(rule);
break;
Getting a XmlDocument from a XmlReader
One of the issues I did have was classes which extended the load
behaviour of an existing class. For example, one abstract class
has a number of base properties, which I easily converted to use
XmlReader
. However, this class is inherited by other classes
and these load additional properties. Using the loop method
outlined above it wasn't possible for these child classes to
read their data as the reader had already been fully read. I
didn't want to have these derived classes has to do the loading
of base properties, and I didn't want to implement any half
thought out idea. So, instead these classes continue to use the
original loading of the XmlDocument
. So, given a source of a
XmlReader
, how do you get an XmlDocument
?
Turns out this is also very simple - the Load
method of the
XmlDocument
can accept a reader. The only disadvantage is the
constructor of the XmlDocument
doesn't support this, which
means you have to explicitly declare a document, load it, then
pass it on, demonstrated below.
void IXmlPersistance2.Read(string fileName, XmlReader reader)
{
XmlDocument document;
document = new XmlDocument();
document.Load(reader);
((IXmlPersistance)this).Read(fileName, document.DocumentElement);
}
Fortunately these classes aren't used frequently and so they shouldn't adversely affect the performance tuning I'm trying to do.
I could have used the GetAttribute
method I discussed earlier
as this doesn't move the reader, but firstly I didn't discover
that method until after I'd wrote this section of the article
and I thought it had enough value to remain, and secondly I
don't think there is an equivalent for elements.
The final verdict
Using the XmlReader
is certainly long winded compared to the
original code. The core of the original code is around 100
lines. The core of the new code is more than triple this. I'll
probably replace all the "move to next attribute" loops with
direct calls to GetAttribute
which will cut down the amount of
code a fair bit. I may also try to do a generic approach using
reflection, although this will then have its own performance
drawback.
However, the XML load performance increase was certainly worth the extra code - the average went from 25seconds down to 12seconds. This is still quite slow and I certainly want to improve it further, but at less than half the original load time I'm pleased with the result.
You also need to be careful when writing the document. In Cyotek
crawler projects, as we are using XPath to query an entire
document, we can load values no matter where they are located.
When using a XmlReader
, the values are read in the same order
as they were written - so if you have saved a critical piece of
information near the end of the document, but you require it
when loading information at the start, you're going to run into
problems.
Update History
- 2010-11-05 - First published
- 2020-11-21 - Updated formatting
Leave a Comment
While we appreciate comments from our users, please follow our posting guidelines. Have you tried the Cyotek Forums for support from Cyotek and the community?
Comments
DotNetKicks.com
#
[b]Using the XmlReader class with C#[/b] You've been kicked (a good thing) - Trackback from DotNetKicks.com - Trackback from DotNetKicks.com
DotNetShoutout
#
[b]Using the XmlReader class with C#[/b] Thank you for submitting this cool story - Trackback from DotNetShoutout - Trackback from DotNetShoutout
Calabonga
#
Thanks! The post is very helpfull. XmlReader is not my choise!
Slobodan
#
Its worth noting the existence of the XPathNavigator class that in some respects behaves like a "reader" but can also be used for direct access to "named" locations using XPath.
With careful planning of the XML layout (to place the elements in a natural order) you can achieve the performance of the "reader" and the conciseness of the XmlDocument.
Richard Moss
#
Slobodan,
Thanks for your comments, I shall check into the XPathNavigator class too and see what this has to offer.
Regards; Richard Moss