All about XML

Introduction

?

xpath

namespace

xslt

xpath

sax

pykml.parser vs. lxml parser?

tag

element = start-tag + end-tag, or empty-element tag

attribute = markup construct consisting of a name–value pair that exists within a start-tag or empty-element tag

node, element, attribute

<root>

    <child name="child1">

</root>

child1 = value

name = attribute

child = element

root = node

ElementTree, a DOM-like API is available here: http://effbot.org/zone/element-index.htm

Lxml, which implements the ElementTree API and also provides XSLT and XPath and more, is available here: http://codespeak.net/lxml/.

https://en.wikipedia.org/wiki/XML

XML parser = processor. Strings of characters that are not markup are content.

The XML specification defines a valid XML document as a well-formed XML document which also conforms to the rules of a Document Type Definition (DTD). In addition to being well-formed, an XML document may be valid. This means that it contains a reference to a Document Type Definition (DTD), and that its elements and attributes are declared in that DTD and follow the grammatical rules for them that the DTD specifies. The oldest schema language for XML is the document type definition (DTD), inherited from SGML.

Existing APIs for XML processing tend to fall into these categories:

Stream-oriented facilities require less memory and, for certain tasks based on a linear traversal of an XML document, are faster and simpler than other alternatives. Tree-traversal and data-binding APIs typically require the use of much more memory, but are often found more convenient for use by programmers; some include declarative retrieval of document components via the use of XPath expressions.

XSLT is designed for declarative description of XML document transformations, and has been widely implemented both in server-side packages and Web browsers. XQuery overlaps XSLT in its functionality, but is designed more for searching of large XML databases.

Document Object Model (DOM) is an API that allows for navigation of the entire document as if it were a tree of node objects representing the document's contents.

For Python: pyKML, SimpleKML, fastkml ("If available, lxml will be used to increase its speed.")

XML-RPC

This is an easier alternative to SOAP to build web services, ie. writing routines accessible from a remote host through the HTTP protocol to carry XML-formated data. Obviously, you should take into consideration the latency of calling routines over a network link, especially over a WAN link like the Internet, and also the fact that the code that make up the routines must be interpreted each time the routine is called; take a look at JIT compilers and cache managers to lower the cost.

Here's a sample XML-RPC call looks like (as copied from here):

POST xmlrpcexample.php HTTP/1.0
User-Agent: xmlrpc-epi-php/0.2 (PHP)
Host: localhost:80
Content-Type: text/xml
Content-length: 191
<?xml version='1.0' encoding="iso-8859-1" ?>
<methodCall>
<methodName>greeting</methodName>
<params>
 <param>
  <value>
   <string>Lucas</string>
  </value>
 </param>
</params>
</methodCall>  

Some solutions to access a web service through XML-RPC from VB and PHP code:

PHP

COM

Here's a sample that worked using IXR_Library.inc.php (web server) and vbXMLRPC (VB5 client):

//Server : /xmlrpc/timsrv.php
<?php
include('IXR_Library.inc.php');
 
function getTime($args) {
        return "Yes! Ain't XML-RPC cool && dandy?";
}
 
$server = new IXR_Server(array('test.getTime' => 'getTime'));
?>
 
//Client
Dim linsRequest As New XMLRPCRequest
Dim linsResponse As XMLRPCResponse
Dim linsUtility As New XMLRPCUtility
 
linsRequest.HostName = "localhost"
linsRequest.HostPort = 80
linsRequest.HostURI = "/xmlrpc/timesrv.php"
linsRequest.MethodName = "test.getTime"
 
Set linsResponse = linsRequest.Submit
Label1.Caption = linsResponse.Params(1).StringValue

Parsing with XPath

Note: XPath 2.0 came out in 2007, 3.0 in 2014, and 3.1 in 2017.

Command-line utilities

Two command-line utilities to work with xml files, using XQuery, XPath (a subset of XQuery), or CSS Selector.

XMLStarlet

https://en.wikipedia.org/wiki/XMLStarlet

https://xmlstar.sourceforge.net/overview.php

Usage: xml <command> [<cmd-options>]

where <command> is usually:

   sel   (or select)    - Select data or query XML document(s) (XPATH, etc)

   ed    (or edit)      - Edit/Update XML document(s)

   el    (or elements)  - Display element structure of XML document

Questions

BAD xml.exe el -u --html https://www.acme.com/index.html

BAD wget -qO - https://www.acme.com/index.html | xml.exe el -u --html

Xidel

Learn

Example

xidel input.gpx -se "//trk/trkseg/*" --color=never --printed-node-format xml --output-node-format xml --output-declaration="<?xml version=\"1.0\" encoding=\"utf-8\"?>"

Important: Order matters with parameters

Questions

Reading Notes from "XPath and XPointer", John Simpson, 2002

If quotation marks surround the token, it's assumed to be a string. If no quotation marks adorn the token, an XPath-smart application assumes that the token represents a node name.

As a special case, a node name can also be represented with an asterisk (*). This serves as a wildcard (all nodes, regardless of their name) character. The expression taxcut/* locates all elements that are children of a taxcut element. You cannot, however, use the asterisk in combination with other characters to represent portions of a name. Thus, tax* doesn't locate all elements whose names start with the string "tax"; it's simply illegal as far as XPath is concerned.

Delimiters: /, [], = , != , < , > , <= , and >=, ::, // , @ , . , and .., |, (), + , - , * , div , and mod.

If need be, normalize-space() trims all leading and trailing whitespace from a given element's content.

Reading Notes from "XSLT 2.0 and XPath 2.0 Programmer’s Reference" by Michael Kay, 2008

XPath: Pages 521-680,1117-1122

Reading Notes from "Beginning XML, 2nd Edition" By David Hunter, Kurt Cagle, Chris Dix et al., 2003

This is where the extensible in Extensible Markup Language comes from: anyone is free to mark up data in any way using the language, even if others are doing it in completely different ways.

There have already been numerous projects to produce industry-standard vocabularies to describe various types of data. For example, Scalable Vector Graphics (SVG) is an XML vocabulary for describing two-dimensional graphics.

XSLT was created for transforming XML documents from one format to another and that could potentially make these kinds of transformations very simple.

What HTML does for display, XML is designed to do for data exchange.

XML also groups information in hierarchies. The items in our documents relate to each other in parent/child and sibling/sibling relationships. These "items" are called elements.

This structure is also called a tree; any parts of the tree that contain children are called branches, while parts that have no children are called leaves.

Because the <name> element has only other elements for children, and not text, it is said to have element content. Conversely, since <first>, <middle>, and <last> have only text as children, they are said to have simple content. Elements can contain both text and other elements. They are then said to have mixed content.

Document type: structured in a specific way, to describe a specific type of information.

DTDs and Schemas provide ways to define our document types.

Namespaces provide a means to distinguish one XML vocabulary from another, which allows us to create richer documents by combining multiple vocabularies into one document type.

XPath describes a querying language for addressing parts of an XML document. This allows applications to ask for a specific piece of an XML document, instead of having to always deal with one large "chunk" of information.

For simpler cases, we can use Cascading Style Sheets (CSS) to define the presentation of our documents. And, for more complex cases, we can use Extensible Stylesheet Language (XSL), that consists of XSLT, which can transform our documents from one type to another, and Formatting Objects, which deal with display.

XLink and XPointer are languages that are used to link XML documents to each other, in a similar manner to HTML hyperlinks.

Two ways for traditional applications to interface with XML documents: document object model (DOM), and Simple API for XML (SAX).

XML is also used as a protocol for Remote Procedure Calls (RPC). Using a technology called the Simple Object Access Protocol (SOAP), allows this to occur even through a firewall, which would normally block such calls, providing greater opportunities for distributed computing.

The text between the start-tag and end-tag of an element is called the element content.

The root element contains the entire XML document.

An empty element is called a self-closing tag, eg. <parody />.

In addition to tags and elements, XML documents can also include attributes:

<name nickname="Shiny John">
<first>John</first>
<middle>Fitzgerald Johansen</middle>
<last>Doe</last>
</name>

Use attributes for infos that are only relevant to a few records; Otherwise, use elements.

An XML declaration isn't required, but it's considered good practice to include it:

<?xml version='1.0' encoding='UTF-16' standalone='yes'?>

It's recommended to encode documents in UTF-8 or UTF-16, but other encodings can be used:

<?xml version="1.0" encoding="windows-1252"?>
<?xml version="1.0" encoding="ISO-8859-1"?>

Although it isn't all that common, sometimes you need to embed application-specific instructions into your information, to affect how it will be processed. XML provides a mechanism to allow this, called processing instructions or, more commonly, PIs.

Ways to use reserved characters:

XML namespaces are needed where two document types have elements with the same name, but with different meanings and semantics.

Chapter 3 - XML Namespaces

Chapter 4 - XSLT

In order to perform an XSLT transformation, you need at least three things: an XML document to transform, an XSLT stylesheet, and an XSLT engine.

Extensible Stylesheet Language, as the name implies, is an XML-based language used to create stylesheets. An XSL engine uses these stylesheets to transform XML documents into other document types, and to format the output.

Chapter 5 - Document Type Definitions

Chapter 6 - XML Schemas

Chapter 7 - Advanced XML Schemas

Chapter 8: The Document Object Model (DOM)

An XML document is structured very much like an object model: it is hierarchical, with nodes potentially having other nodes as children.

The DOM is usually added as a layer between the XML parser and the application that needs the information in the document, meaning that the parser reads the data from the XML document and then feeds that data into a DOM. The DOM is then used by a higher-level application. The application can do whatever it wants with this information, including putting it into another proprietary object model, if so desired.

Any part of an XML document is a node.

DOM implementations can be specialized to work only with XML documents or only with HTML documents, or they can be built to work with a number of types of documents. Each Node object provides a NodeList, called childNodes, which contains all of that node's children. We can directly access the first node in that list, using the firstChild property.

The text inside an element is not part of the element itself; it actually belongs to a text node, which is a child of the element node. An element doesn't have any values of its own, only children.

Using "<first>John</first>":

//pops up a message box saying "first"
alert(oNode.nodeName);
 
//pops up a message box saying "null": The text inside an element is not part of the element itself; it belongs to
//a text node, which is a child of the element node
alert(oNode.nodeValue);

The documentElement property is a special property of the Document interface, which returns the <root>node.

Many of the properties and methods in the DOM will return a collection of Nodes, instead of just one, which is why the NodeList and NamedNodeMap interfaces were created.

To get a list of all of the nodes named "name" in an XML file:

var oNodeList;
//returns a NodeList, containing all of the descendant elements of a node that have the specified tag
oNodeList = oDOM.getElementsByTagName("name");
 
alert(oNodeList.item(1).firstChild.nodeValue);

The Node interface has an attributes property, which returns a NamedNodeMap:

var oMap;
oMap = oDOM.documentElement.attributes;
 
alert(oMap.getNamedItem("first").nodeValue);

When creating an XML document from scratch, or even adding nodes to an existing document, most of the work is done through the Document interface. This interface provides factory methods that can be used to create other objects, for example createElement() or createAttribute(). Once a new node has been created, it must be appended to the document. The Node interface provides the appendChild() and insertBefore() methods to do this:

var oNode, oText;
 
oNode = oDOM.createElement("root");
oText = oDOM.createTextNode("root PCDATA");
 
oDOM.appendChild(oNode);
oNode.appendChild(oText);
 
var oAttr;
oAttr = oDOM.createAttribute("id");
oAttr.nodeValue = "123";
oNode.attributes.setNamedItem(oAttr);
alert(oDOM.xml);

Here's how to remove a node:

var oNode, oRemovedNode;
 
oNode = oDOM.documentElement.firstChild;
oRemovedNode = oNode.removeChild(oNode.firstChild);
 
alert(oRemovedNode.nodeValue);
alert(oDOM.xml);

The DOM defines two interfaces to work with text:

Because CharacterData extends Node, both CharacterData objects and Text objects are also Node objects.

alert(oText.length);
alert(oText.data);
alert(oText.substringData(12, 4));
oText.appendData(".");
oText.insertData(12, "groovy ");
oText.deleteData(12, 7);
oText.replaceData(8, 8, "a");
oText.splitText(12);

In order to get at the PCDATA in the <DemoElement> element, we had to write code like this:

alert(oDOM.firstChild.firstChild.firstChild.nodeValue);

It would be a lot easier if we could give the DOM an XPath expression, containing the node or nodes that we want, and have it give us back the relevant data. For example, we could get at the same PCDATA using selectSingleNode(), as follows:

alert(oDOM.selectSingleNode("/root/DemoElement").nodeValue)

Using selectNodes() we could filter that, to only return <name> elements that have a first attribute with a value of "John".

oDOM.selectNodes("//name[@first='John']")

getElementsByTagName(), as its name implies, can only return elements, whereas selectSingleNode() and selectNodes() can return any node types. We could use selectNodes() to get back a NodeList containing all of the first attributes in a document like so:

oDOM.selectNodes("//@first")
 
alert(oDOM.selectSingleNode("/people/managers/name[1]").firstChild.nodeValue);

Because the DOM is creating all of these objects in memory, one for each and every node in the XML document, DOM implementations can be quite large, and processing XML documents via the DOM can take up a lot of memory. If the DOM is too slow, or takes up too much memory, we can use the Simple API for XML (SAX) instead.

Unlike DOM, SAX is event-driven: Rather than parse the document into the DOM and then use the DOM to navigate around the document, we tell the parser to raise events whenever it finds something. This is done through callback methods.

The most important methods in SAX's ContentHandler are as follows:

Here are a few drawbacks of SAX:

Chapter 10: SOAP

Chapter 11: Displaying XML

Chapter 12 - XML and Databases

Chapter 13 - Linking and Querying XML

XLink and XPointer are ways to link XML documents together. XQuery is a new query language for XML.

Reading Notes from "Learning XML, 2nd Edition" By Erik T. Ray, O'Reilly, September 2003

XML's markup divides a document into separate information containers called elements.

If XML markup is a structural skeleton for a document, then tags are the bones. They mark the boundaries of elements, allow insertion of comments and special instructions, and declare settings for the parsing environment. A parser, the front line of any program that processes XML, relies on tags to help it break down documents into discrete XML objects. Inside element start tags, you sometimes will see some extra characters next to the element name in the form of name="value". These are attributes. They associate information with an element that may be inappropriate to include as character data.

An XML document has two parts. First is the document prolog, a special section containing metadata. The second is an element called the document element, also called the root element for reasons you will understand when we talk about trees. The root element contains all the other elements and content in the document. The prolog is optional. If you leave it out, the parser will fall back on its default settings.

The markup symbols are delineated by angle brackets (<>). <to> and </villain> are two such symbols, called tags. The data, or content, fills the space between these tags.

Document type definition (DTD). <!DOCTYPE...> is one example of a type of markup called a declaration. Declarations are used to constrain grammar and declare pieces of text or resources to be included in the document. This line isn't required unless you want a parser to validate your document's structure against a set of rules you provide in the DTD.

The document element is also sometimes called the root element.

The empty tag <graphic.../>, which represents an empty element. Rather than containing data, this element references some other information that should be used in its place, in this case a graphic to be displayed. Empty elements do not mark boundaries around text and other elements the way container elements do, but they still may convey positional information. Every element that contains data has to have both a start tag and an end tag or the empty form used for graphic. (It's okay to use a start tag immediately followed by an end tag for an empty element; the empty tag is effectively an abbreviation of that.)

Strictly speaking, XML is not a markup language. A language has a fixed vocabulary and grammar, but XML doesn't actually define any elements. Instead, it lays down a foundation of syntactic constraints on which you can build your own language. So a more apt description might be to call XML a markup language toolkit.

Because XML doesn't have a predetermined vocabulary, it's possible to invent a markup language as you go along. Documents that follow the syntax rules of XML are well-formed XML documents. A document model is the blueprint for an instance of a markup language. It gives you an even stricter test than well-formedness. When a document instance matches a document model, we say that it is valid.

There are several ways to define a markup language formally. The two most common are document type definitions (DTDs) and schemas. Schemas are a later invention, offering more flexibility and a way to specify patterns for data, which is absent from DTDs.
One limitation of DTDs is that they don't do much checking of text content. An alternative document modeling scheme provides the solution. XML Schemas provide much more detailed control over a document, including the ability to compare text with a pattern you define.

The XPath language provides a convenient method to specify which nodes to return in a tree context. A parser written as a hybrid will only need to return a list of nodes that match an XPath expression. A stream parser efficiently searches through the document to find the nodes, then passes the locations to a tree builder that assembles them into object trees. XPath's advantage is that it is has a very rich language for specifying nodes, giving the developer a lot of control and flexibility.

The two most popular stylesheets are Cascading Style Sheets (CSS) and the Extensible Style Language (XSL). The former is very simple and fine for most online documents. The latter is highly detailed and better for print-quality documents.

Extensible Style Language Transformations (XSLT) can automate the task of converting between one format and another in a process called transformation. Transformation in XML is typically done with the language XSLT, essentially a programming language optimized for transforming XML. It requires a transformation instruction which happens to be called a stylesheet (not to be confused with a CSS stylesheet). An XSLT processor is a program that takes an XML document and an XSLT stylesheet as input and outputs a transformed document.

Most programming languages have support for parsing and navigating XML. They frequently make use of two standard interfaces. The Simple API for XML (SAX) is very popular for its simplicity and efficiency. The Document Object Model (DOM) outlines an interface for moving around an object tree of a document for more complex processing.

PyXML supports DTD validation, SAX2, DOM2, PullDOM.

To return to the ideals of generic coding, some people tried to adapt SGML for the Web—or rather, to adapt the Web to SGML. This proved too difficult. SGML was too big to squeeze into a little web browser. A smaller language that still retained the generality of SGML was required, and thus was born the Extensible Markup Language (XML).

Parsing:

DOM and SAX are often too complex for a simple query like this. XPath is a shorthand for locating a point inside an XML document. It is used in XPointers and also in places like XSLT and some DOM implementations to provide a quick way to move around a document.

STOPPED 2.4 Elements

XPath: Each step in a path touches a branching or terminal point in the tree called a node. In keeping with the arboreal terminology, a terminal node (one with no descendants) is sometimes called a leaf. In XPath, there are seven different kinds of nodes:

XPath uses chains of steps. The terms "child" and "parent" are still applicable. A location path is a chain of location steps that get you from one point in a document to another. If the path begins with an absolute position (say, the root node), then we call it an absolute path. Otherwise, it is called a relative path because it starts from a place not yet determined. A location step has three parts: an axis that describes the direction to travel, a node test that specifies what kinds of nodes are applicable, and a set of optional predicates that use Boolean (true/false) tests to winnow down the candidates even further.

XPath expressions are statements that can extract useful information from the tree. Instead of just finding nodes, you can count them, add up numeric values, compare strings, and more. They are much like statements in a functional programming language.

XML Pointer Language (XPointer) uses XPath expressions to find points inside external parsed entities, as an extension to uniform resource identifiers (URIs). It could be used, for example, to create a link from one document to an element inside any other.

XSL is really three technologies rolled into one:

XSL Transformations (XSLT): An XSLT processor (I'll call it an XSLT engine) takes two things as input: an XSLT stylesheet to govern the transformation process and an input document called the source tree. The output is called the result tree.

The two main methods of working with XML files with computer languages are event streams (SAX) and object trees (DOM).

The stream approach treats XML content as a pipeline. As it rushes past, you have one chance to work with it, no look-ahead or look-behind. It is fast and efficient, allowing you to work with enormous files in a short time, but depends on simple markup that closely follows the order of processing. An XML stream emits a series of tokens or events, signals that denote changes in markup status. For example, an element has at least three events associated with it: the start tag, the content, and the end tag.
The XML stream is constructed as it is read, so events happen in lexical order. The content of an element will always come after the start tag, and the end tag will follow that. Somewhere between chopping up a stream into tokens and processing the tokens is a layer one might call an event dispatcher. It branches the processing depending on the type of token. The code that deals with a particular token type is called an event handler. There could be a handler for start tags, another for character data, and so on. A common technique is to create a function or subroutine for each event type and register it with the parser as a call-back, something that gets called when a given event occurs.

SAX implements what we call push parsing. The parser pushes events at the program, requiring it to react. The parser doesn't store any state information, contextual clues that would help in decisions for how to parse, so the application has to store this information itself.
Pull parsing (eg. XMLPULL) is just the opposite. The program takes control and tells the parser when to fetch the next item. Instead of reacting to events, it proactively seeks out events. This allows the developer more freedom in designing data handlers, and greater ability to catch invalid markup.

The workhorse of SAX is the SAX driver. A SAX driver is any program that implements the SAX2 XMLReader interface. It may include a parser that reads XML directly, or it may just be a wrapper for another parser to adapt it to the interface. It may even be a converter, transmuting data of one kind (say, SQL queries) into XML.

Where streams fail are situations in which data is so complex that it requires a lot of searching around. For example, XSLT jumps from element to element in an order that may not match the lexical order at all. When that is the case, we prefer to use the tree model.

The tree method is luxurious in comparison to streams. This structure requires more resources to build and store, so you will only want to use it when the stream method cannot help. This persistence is the key reason for using trees. Since a tree is acyclic (it has no circular links), you can use simple traversal methods that won't get stuck in infinite loops. Like a filesystem directory tree, you can represent the location of a node easily in simple shorthand. Like real trees, you can break a piece off and treat it like a smaller tree. Most important, you have all the information in one place for as long as you need it. With streams, you are forced to work with events as they arrive, perhaps storing bits of data for later use. Tree processing is usually object-oriented. The data structure representing the document is composed of objects whose methods allow you to traverse in different directions, pull out data, or modify values.

While SAX defines an interface of handler methods, the DOM specification calls for a number of classes, each with an interface of methods that affect a particular type of XML markup. Thus, every object instance manages a portion of the document tree, providing accessor methods to add, remove, or modify nodes and data. These objects are typically created by a factory object, making it a little easier for programmers who only have to initialize the factory object themselves. In DOM, every piece of XML (the element, text, comment, etc.) is a node represented by a node object.

If streams and trees are the two extremes on a spectrum of XML processing techniques, then the middle ground is home to solutions we might call hybrids. They combine the best of both worlds, low resource overhead of streams with the convenience of a tree structure, by switching between the two modes as necessary. The idea is, if you are only interested in working with a small slice of a document and can safely ignore the rest, then you only need to work with a subtree. The parser scans through the stream until it sees the part that you want, then switches to tree building mode.

Data binding: Some developers don't need direct access to XML document structures—they just want to work with objects or other data structures. Data binding approaches minimize the amount of interaction between the developer and the XML itself. Instead of creating XML directly, an API takes an object and serializes it. Instead of reading an XML document and interpreting its parts, an API takes an XML document and presents it as an object. Data binding processing tends to focus on schemas, which are used as the foundation for describing the XML representing a particular object.

Temp

"SAX is a top-down parser and allows serial access to a XML document, and works well for read only access. DOM on the other hand is more robust - it reads the entire XML document into a tree, and is very efficient when you want to alter, add, remove data in that XML tree. XPath is useful when you only need a couple of values from the XML document, and you know where to find them (you know the path of the data, /root/item/challange/text)." (Source)

"XPath is used to retrieve and interpret information represented in XML files using either a DOM or SAX parser." (Source)

"XPATH is a different animal altogether. Its just a technique for querying documents. For example, I can ask it to pull out just one node at a time without worrying about the rest of the document. From that point you have to decide what parser you are going to use to process the data, SAX or DOM.
And just for good measure, SimpleXML is very similar to DOM, but makes it a whole lot easier to use as you can traverse it using standard iteration techniques like you would with arrays, objects and iterators" (Source)

http://www.powerbasic.com/support/forums/Forum7/HTML/002018.html

http://www.gipsysoft.com/qhtm/doc/

http://www.xml-rpc.net/faq/xmlrpcnetfaq.html

Resources