Selecting the Right Text Format LG #98

...making Linux just a little more fun!

Selecting the Right Text Format
By Dean Wilson

Selecting the Right Text Format

I have been working on a project that involves storing textual data so that it can be easy searched and presented in various formats. The information you see here was written to help me explain to others what choices were available showing their advantages and disadvantages.

Binary Formats

These are formats where you need additional information in order to make any sense whatsoever of the data. The most popular example is the .doc format used by Microsoft Word. If you look at a Word document in a text editor you will not see your text. This problem alone makes this format useless for a system which much search and present data.

While there is no reason to dwell on this format, I would like to explain one additional concern which is applicable to many other situations than the one I am working on. It is quite common for the vendor to change the format over time claiming that they need to do this to enhance the capabilities of their programs. This may or may not be true but it can easily mean that some future version of their software will not use the same format making either your old data useless or making it impossible to create new data in the format you had originally used.

Because of the unsuitability of these types of formats for the system I am working on, I will not go into any further detail. The following formats are all text-based.

Presentation Oriented

These languages allow you to see your original textual information within the document. The document contains information on how it is to be presented within the text itself. Examples of this type of markup are troff, TeX, Rich Text Format (RTF) and PostScript. Of these three examples, troff and TeX are by far the oldest and also by far the easiest to extract text from.

Troff was initially written to produce typoutput on a particular phototypesetter. There is a related program, nroff, which was designed to take the same basic document format and produce output on a regular printer. TeX (and LaTeX) were designed to typeset complicated documents containing special mathematical symbols not available in standard ASCII text. It is relatively easy to extract the input text from basic text documents in these formats. As the document content becomes more complicated, extraction also becomes more complicated.

Early on in the Word Processor Wars, Microsoft created a new standard (that is, they called it a standard) for document interchange called RTF. Unlike troff or TeX, the user wasn't supposed to create documents in RTF. It was just to be used as an interchange medium between different word processors.

Finally, PostScript was written to be a descriptive language for files that was independent of output devices. For example, the same PostScript document could be printed on a relatively inexpensive laser printer with 300 dot per inch resolution or on a phototypesetter with 3000 or greater dot per inch resolution.

In all these languages, the emphasis is on describing what you want the document to look like. You describe font sizes, type styles and positioning within the page. In order to search the original text of the document you must strip out all this formatting information.

Fixed Markup

As you can see from the previous descriptions, none of these formats make the original information available for easy searching. Also, you need some sort of conversion program to translate these document formats into the various presentation formats.

Long before Microsoft created the "RTF standard", SGML was around. SGML is a generalized source document markup language that is designed to specify what is in a document rather than how it is to be displayed or printed. SGML, however, is general enough to be complicated for the user and for a slow computer to work with.

HTML is close to a dialect of SGML. Close because HTML does not obey all basic SGML conventions and a dialect because HTML defines specific document markup which can be used. For example, the <p> markup is used to identify the beginning of a paragraph. There are two problems here with regard to what I need to do:

While the closing tag </p> tag is now allowed in HTML it is not required. This makes processing harder.
While I know where a paragraph starts, I don't know what that paragraph might represent.

Also, HTML has evolved to such an extent that there are a large number of tags. Many of these tags have to do with presentation rather than offering any information about the actual use of the document content. Examples here are the strong, bold and italic tags.

Variable Markup

Let me list what I have learned so far.

Binary formats won't work for searching.
It is hard to extract the original document information from languages designed to describe presentation.
Having a markup that describes the function of a datum rather than what it is to look like is highly desirable.
Having a well-defined syntax makes parsing easier.

It would be relatively easy to write such a markup. In fact, I have done this many times for specific projects. I remember one very basic system where I used a single letter followed by a colon to identify the type of data a record contained. Records were separated by a newline character. Even if we knew all possible record types, there is still a very significant limitation in this sort of implementation. You cannot describe relationships within the data.

A rather obvious example is an address. If you impose a structure on it that is applicable to an address in the United States you might end up with something like this:

  Name|Addr1|Addr2|City|State|Zip

Thinking on more global terms, you could add country to the end of the data. Unfortunately, you would then discover that in Spain, for example, postal codes go before the City.

While you could then write code to look in the seventh field (country) for Spain and modify how the information is printed you would quickly discover there were many other exceptions. With the relatively low cost of data storage today, a better approach would be to add more information about the information within the data record. If this information was about what the data was rather than how to process it and it was put in in a well-structured way, it would be very easy to work with.

Enter XML which stands for Extensible Markup Language. XML is designed to do precisely this job. Looking back at the address model, you could present the address in XML like this:

  <address>
    <name>Name</name>
    <addr1>Addr1</addr1>
    <addr2>Addr2</addr2>
    <city>City</city>
    <state>State</state>
    <zip>Zip</zip>
  </address>

There is nothing special about the spacing and indentation. This is just to make it clear to the reader what I am doing. The only thing that is important is that the address information starts with <address> and ends with </address>.

Adding a place for country is as easy as defining the <country> tag. Presentation rules do not have to be put into the data itself. There is another language, XML Stylesheet Language Tranformations (XSLT) that allows you do define processing rules to translate XML into desired output formats.

Conclusion

The most important part of this exercise for me was being able to look at existing formats and pick a good solution. Sometimes a new approach or format needs to be developed (PostScript is a good example) but it is always going to be less work if you can start with something that exists.

Because XML is extensible you are not picking something that is a close fit. You are actually selecting an exact fit that allows you to address your future needs. With all the different tools available for XML and XSLT and the number of uses expanding every day, developing your applications around this format will get easier in the future.

Robert Wilson is a Systems Administrator in a company where the boss (who has no idea what Bob does) just says "make it work".

[BIO] Dean Wilson is (this week) a systems administrator and occasional updater to his pages at www.unixdaemon.net