Differentiate between validating and non validating parser
Due to UTF-8 structure there is no need to parse UTF-8 byte sequences unless you're looking for specific non-ASCII characters, because in valid UTF-8 streams all bytes below 128 are standalone ASCII characters (i.e. [^fnconformunicode] [^fnconformunicode]: Note that conforming XML parsers are required to reject certain Unicode codepoints.
Pugixml sacrifices this analysis for increased performance.
memory management algorithms are widely applicable beyond parsers).
Since there are several substantially different approaches to XML parsing, and the parser has to do additional processing that even people familiar with XML do not know about, it is important to outline the entire task at hand first, before diving into implementation details.
As mentioned before, parsers traditionally use lexers to convert the character stream into a token stream.
This can improve performance in cases where a parser has to do a lot of backtracking, but for XML parsers a lexer stage is just an extra layer of complexity that increases the per-character overhead.
XML is a compromise between parsing performance, human readability and parsing code complexity --- therefore a fast XML parser can make the choice of XML as an underlying format for application data model more preferable.
For performance purposes, "production ready" mainly means resistance to malformed data.
Sacrificing buffer overrun checks to improve performance is not feasible.
This chapter describes various performance tricks that allowed the author to write a very high-performing C parser, pugixml (TODO REF 1 pugixml).
While the techniques were used for an XML parser, most of them can be applied to parsers of other formats or even unrelated software (i.e.
[^fndoctypes]: Document type (DOCTYPE) declarations are parsed but their contents are ignored.