Introduction to XML

This article was machine translated and it's kind of unreadable right now. I'll be cleaning the translation soon. Please come back!

The era of information wants its own language!

Interesting connections (some related =)) (and that support without knowing it this page):

The problem

Transporting information from a place to another one was never a easy task in computing. The information is usually strongly bound to the program where it was created, and that's why people often waste lots of time to convert from Word to Excel to Quatro Pro is lost to pages HTML to the-that-is. And in addition that information also is strongly bound to how it wanted to see it the one that created it. Many surely have had to bear the arduous task of rearrange a document of Word done by a neophyte, one which have used spaces instead of tabs or put enters at the end of each line =).

It would be better have some mechanism to just have pure information. Does it help using just text? No, because the information has its own internal structure that is important to preserve, even more if we must handle important collections of many similar documents. One should be able to easily extract all the titles of the documents of such a collection.

A simple example (and perhaps beaten): a collection of 750 recipes. To write them all in Word? What a danger! What would happen if they tell me that it is necessary to have them in HTML? Or is necessary to print them in a book using specific programs and certain font style for the ingredients? So, it's advisable to pay some more attention when choosing the format to use. Only text? nah.. is evident that a recipe is divided in well identified parts, so to use only text it is to lose information, and we are here for trying to keep the greater amount of it.

The solution

XML tells us that we can structure the information in a tree. That is to say, to imagine to the recipe like a component, that as well this formed of components, and so on. Each component could have text and/or component. Is understood? A possible structure would be to imagine that the recipe has a called component we needed. All the text would not be within we needed, only those things that the potential executor of the recipe would need to carry out it successful. Inside we could have one or component calls ingredient. Let us see as this is seen (using the XML syntax already).

 
<recipe>
...
	<we-need>
		<ingredient>2 spoons of sugar</ingredient>
		<ingredient>3 apples</ingredient>
	</we-need>
...
</recipe>

Isn't it easy to guess how's the XML syntax? It's simply to enclose the text belonging to a component between <componente> and </componente>. Ok, in fact nobody calls them components. People usually call these beasts element, and the marks that delimit where they begin and where they end are called tags.

On screen

The circle is completed by means of a style sheet, which is a description of how does a piece of information must be shown in a certain medium. To a same XML document different style sheets can be applied to him as it's deemed useful. E.g., using a style sheet for each medium in which the information should be represented. One to print a book, another one for a WWW page and another one for a program that reads the information aloud. The first two will say something like this:

However, a style sheet destined to format the information for its reading will have something like

Also the volume could be controlled, or even what side the voice comes from.

The style sheets can also control the conversion of this "pure information" in simpler formats, like rtf (to use with Word), HTML, etc. This way, when I'll already have 750 recipes typed and online in the WWW by using a well-suited style sheet, and somebody tells me Bring me all the stuff in XXX format tommorrow morning so that we publish a book, I will be able to proudly respond: Yes, of course!.

Two languages of style sheets exist at the moment, and I'll describe them briefly next.

CSS

One of them is the very well-known CSS (Cascading Style Sheets), which is already partially is implemented in the most important current WWW browsers. This language simply allows to define how tags should be handled. You can only provide simple indications like... this one in red, and this one in blue and big.

A page description in CSS looks similar to this:


ingredient {
	font-family: sans-serif;
	color: red;
}
XSL

The W3 Consortium created a new style sheets language called XSL (eXtended Stylesheet Language) that, in addition to which CSS offers, has the capacity to work like a transformation language, being reached the functionality that it described above.

XSL is in fact two very different standards:

XSLT

It's a language that it describes how does an XML file must be transformed into another XML file. For example, a file containing tags that describes a recipe can become by using this system into another file containing tags which only describes the position of the text in a printed page. Obviously, in such a transformation some information will get lost. The idea is that by applying different XSLT files you can get different XML files, each suitable for different presentations or uses of the information.

The 1.0 version of this part of the standard was published in 1999, and there are several implementations of varying quality.

XSL FO

FO stands for formatting objects. An XSLFO file is a species of very dirty HTML of tags of colors, sizes and position. It is a file that does not preserve anything of the semantics of the original information, only describes as it must be in screen, or paper. He is similar in concept to the language PostScript, or perhaps to TEX.

This format, that obvious is a format XML, can be used generating it by means of a transformation XSLT. Following the example that I used before already we can describe a quite complete circuit of which happens with the information:

XML with recipes + XSLT transformation "formatting objects"

This part of the standard is a work in progress and still no browser supports it nor is near of supporting it. Time will say if it becomes an industry standard or not. What yes one is beginning to use meanwhile as presentacional language is XML+CSS. That is to say, of the clean, pretty and good XML to generate dirty HTML, that is not another thing that a presentacional language... is not so bad.

Other related technologies are DOM (Document Object Model) and SAX, that are APIs standard to access "trees" of XML information.

Revolviendo data

From now on I'll be speaking to the programmers, be warned.

We are going to also see that the screen is not the only possible destiny of a document XML. These archivitos are used very easily like entrance for programs. The prescriptions of the previous chapter, could, in some other century, to activate plugin connected a robot, replicador or what it is ordered to execute it = automatically). Perhaps this sounds too much crazy, but in the case of the MathML, that it is going to distribute mathematical equations semantically, it has much sense to move mouse on the equation, and in the menu of the right button to say to him when browser: solve x for me, please.

We all have worked once or twice with CSV files (fields separated by commas or tabs) produced by a spreadsheet. CSV files can be very useful to take information from one place to another. Ok, XML is a species of super-CSV, since it seves the same purpose but solving several important issues. Differences with CSV: Flexibility, an element can be present or not, new elements can be added without breaking the program that interprets the file. The fields are self-describing, so much that they can even be in any order (<one/><two/> is the same as <two><one/> if the interpreting program isn't interestes in the order).

Which is the distinguishing quality of the structured information? That it has structures, is the obvious answer. And those structures are those that one wants to manipulate from some program. We already understood that a file XML loaded in memory is modeled like a tree. Then, an API that allows us to work with the data of a file XML will be very similar to a API that allows to handle trees us. Such an API will then have obvious functions of the type "add child", "give me the father", etc. That API is DOM. More information on DOM can be found in the page of DOM of the W3C.

The designers of the API DOM took the mission of designing an API not for a single language, but for all of them. This API is available in Java, C/C++, Perl, etc. Being such a simple API, and having it already available in all the languages... Why would I, the programmer, develop a component to analyze a text file (a parser)? Don't create more code to analyze formats yourselve! Jump to the XML bandwagon that in addition makes your application yo look more modern! (This in addition to the detail that having less own code means less bugs, but that's always a detail, which matters is to impress others).

XML and communications

XML is being mentioned much nowadays when there are several parts that wish to engage in a dialog to each other. XML provides a form to them to define a vocabulary and to define interfaces among them who are evolving decent with time. For example a site of purchases by Internet engaging in a dialog with a company of credit cards.

Three ways easy exist to use XML to communicate processes:

More information


Also visit my page of GNU/Linux.


Nicolás Lichtmaier loves to receive encouraging messages, which say "I have never seen a better page in all my life". Ok, critical comments too.

Valid HTML 4.01!