html2db.xsl"> ]> This title is ignored

html2db.xsl

Oliver Steele 1 2004-07-30 1.0.1 2004-08-01 Editorial changes to the readme. 2004-07-30

Overview

&html2db; converts an XHTML source document into a Docbook output document. It provides features for customizing the generation of the output, so that the output can be tuned by annotating the source, rather than hand-editing the output. This makes it useful in a processing pipeline where the source documents are maintained in HTML, although it can be used as a one-time conversion tool too.

This document is an example of &html2db; used in conjunction with the Docbook XSL stylesheets. The source file is an XHTML file with some embedded Docbook elements and processing instructions. &html2db; compiles it into a Docbook document, which can be used to generate this output file (which includes a Table of Contents), a chunked HTML file, a PDF, or other formats.

Features

XSLT implementation
This tool is designed to be embedded within an XSLT processing pipeline. html2html.xslt can be used in a custom stylesheet or integrated into a larger system. See Overriding.
Customizable
The output can be customized by the means of additonal markup in the XHMTL source. See the section on customization.
Creates outline structure
h1, h2, etc. are turned into nested section and title elements (as opposed to bridge heads).
Accepts a wide variety of XHTML
In particular, &html2db; automatically wraps naked item text (text that is not enclosed in a <p>) inside a table cell or list item. Naked text is a common property of XHTML documents, but needs to be clothed to create valid Docbook.

This feature is limited. See Implicit Blocks.)

Requirements

&html2db; might work with earlier versions of Java and Xalan, and it might work with other XSLT processors such as Saxon and xsltproc.

License

This software is released under the Open Source Artistic License.

Installation

Usage

Use Xalan to process an XHTML source file into a Docbook file:

java org.apache.xalan.xslt.Process -XSL html2dbk.xsl -IN doc.html > doc.xml

See index.src.html for an example of an input file.

If your source files are in HTML, not XHTML, you may find the Tidy tool useful. This is a tool that converts from HTML to XHTML, and can be added to the front of your processing pipeline.

(If you need to process HTML and you don't know or can't figure out from context what a processing pipeline is, &html2db; is probably not the right tool for you, and you should look for a local XML or Java guru or for a commercially supported product.)

Specification

XHTML Elements

code/i stands for "an i element immediately within a code element". This notation is from XPath.

XHTML elements must be in the XHTML Transitional namespace, http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd.

XHTML Docbook Notes
b, i, em, strong emphasis The role attribute is the original tag name
dfn glossitem, and also primary indexterm
code/i, tt/i, pre/i replaceable In practice, i within a monospace content is usually used to mean replaceable text. If you're using it for emphasis, use em instead.
pre, body/code programlisting
img inlinemediaobject/imageobject/imagedata In an inline context.
img [informal]figure/mediaobject/imageobject/imagedata If it has a title attribute or db:title it's wrapped in a figure. Otherwise it's wrapped in an informalfigure.
table [informal]table XHTML table becomes Docbook table if it has a summary attribute; informaltable otherwise.
ul itemizedlist But see the processing instruction below.

Links

XHTML Docbook Notes
<a name="name"> <anchor id="{$anchor-id-prefix}name"> An anchor within a hn element is attached to the enclosing section as an id attribute instead.
<a href="#name"> <link linkend="{$anchor-id-prefix}name">
<a href="url"> <ulink url="name">
<a name="mailto:address"> <email>address</email>

Tables

XHTML table support is minimal. &html2db; changes the element names and counts the columns (this is necessary to get table footnotes to span all the columns), but it does not attempt to deal with tables in their full generality.

An XHTML table with a summary attribute generates a table, whose title is the value of that summary. An XHTML table without a summary generates an informaltable.

Any trs that contain ths are pulled to the top of the table, and placed inside a thead. Other trs are placed inside a tbody. This matches the commanon XHTML table pattern, where the first row is a header row.

Implicit Blocks

XHTML allows li, dd, and td elements to contain either inline text (for instance, <li>a list item</li>) or block structure (<li><p>a block</p></li>). The corresponding Docbook elements require block structure, such as para.

&html2db; provides limited support for wrapping naked text in these positions in para elements. If a list item or table cell item directly contains text, all text up to the position of the first element (or all text, if there is no element) is wrapped in para. This handles the simple case of an item that directly contains text, and also the case of an item that contains text followed by blocks such as paragraphs.

Note that this algorithm is easily confused. It doesn't distinguish between block and inline XHTML elements, so it will only wrap the first word in <li>some <b>bold</b> text</li>, leading to badly formatted output. Twhe workaround is to wrap troublesome content in explicit <p> tags.

Docbook Elements

Elements from the Docbook namespace are passed through as is. There are two ways to include a Docbook element in your XHTML source:

Global prefix

A fake Docbook namespace

The fake Docbook namespace is urn:docbook. Docbook doesn't really have a namespace, and if it did, it wouldn't be this one. See Docbook namespace for a discussion of this issue.

declaration may be added to the document root element. Anywhere in the document, the prefix from this namespace declaration may be used to include a Docbook element. This is useful if a document contains many Docbook elements, such as footnote or glossterm, interspersed with XHTML. (In this case it may be more convenient to allow these elements in the XHMTL namespace and add a customization layer that translates them to docbook elements, however. See Customization.)


  ...
  

Some textand a footnote.

]]>
Local namespace

A Docbook element may be introduced along with a prefix-less namespace declaration. This is useful for embedding a Docbook document fragment (a hierarchy of elements that all use Docbook tags) within of a XHTML document.


    
      ...
  ...
]]>

The source to this document illustrates both of these techniques.

Both these techniques will cause your document to be invalid as XHTML. In order to validate an XHTML document that contains Docbook elements, you will need to create a custom schema. Technically, you then ought to place your document in a different namespace, but this will cause &html2db; not to recognize it!

Output Processing Instructions

&html2db; adds a few of processing instructions to the output file. The Docbook XSL stylesheets ignore these, but if you write a customization layer for Docbook XSL, you can use the information in these processing instructions to customize the HTML output. This can be used, for example, to set the a onclick and target attributes in the HTML files that Docbook XSL creates to the same values they had in the input document.

<?html2db attribute="name" value="value"?>
Placed inside a link element to capture the value of the a target and onclick attributes. name is the name of the attribute (target or onclick), and value is its value, with " and \ replaced by \" and \\, respectively.
<?html2db element="br"?>
Represents the location of an XHTML br element in the source document.

You can also include <?db2html?> processing instructions in the HTML source document, and they will be copied through to the Docbook output file unchanged (as will all other processing instructions).

Customization

XSLT Parameters

<xsl:param name="anchor-id-prefix" select="''/>
Prefixed to every id generated from <a name=> and <a href="#">. This is useful to avoid collisions between multiple documents that are compiled into the same book. For instance, if a number of XHTML sources are assembled into chapters of a book, you style each source file with a prefix of docid. where docid is a unique id for each source file.
<xsl:param name="document-root" select="'article'"/>
The default document root. This can be overridden by <?html2db class="name"> within the document itself, and defaults to article.

Processing instructions

Use the <?html2db?> processing instruction to customize the transformation of the XHTML source to Docbook:

Processing instruction Content Effect
<?html2db class="xxx"?> body Sets the output document root to xxx. Useful for translating to prefix, appendix, or chapter; the default is $document-root.
<?html2db class="simplelist"?> ul Creates a vertical simplelist.Note that the current implementation simply checks for the presence of any html2db processing instruction.
<?html2db rowsep="1"?> [informal]table Sets the rowsep attribute on the generated table.Note that the current implementation simply checks for the presence of any html2db processing instruction that begins with rowsep, and assumes the vlaue is 1.

Overriding the built-in templates

For cases where the previous techniques don't allow for enough customization, you can override the builtin templates. You will need to know XSLT in order to do this, and you will need to write a new stylesheet that uses the xsl:import element to import html2db.xsl.

The example.xsl stylesheet is an example customization layer. It recognizes the <div class="abstract"> and <p class="note"> classes in the source for this document, and generates the corresponding Docbook elements.

FAQ

Why generate Docbook?

The primary reason to use Docbook as an output format is to take advantage of the Docbook XSL stylesheets. These are a well-designed, well-documented set of XSL stylesheets that provide a variety of publishing features that would be difficult to recreate from scratch for HTML:

Why write in XHTML?

Given that Docbook is so great, why not write in it?

Where there are not legacy concerns, Docbook is probably a better choice for structured or technical documentation.

Where the only legacy concern is the documents themselves, and not the tools and skill sets of documentation contributors, you should consider using an (X)HMTL convertor to perform a one-time conversion of your documentation source into Docbook, and then switching development to the result files. You can use this stylesheet to perform this conversion, or evaluate other tools, many of which are probably appropriate for this purpose.

Often there are other legacy concerns: the availability of cheap (including free) and usable HTML editors and editing modes; and the fact that it's easier to teach people XHTML than Docbook. If either of this is an issue in your organization, you may want to maintain documentation sources in XHTML instead of Docbook

For example, at Laszlo, most developers contribute directly to the documentation. Requiring that developers learn Docbook, or that they wait on the doc team to get content into the docs, would discourage this.

Why not use an existing convertor?

This isn't the first (X)HTML to Docbook convertor. Why not use one of the exisitng ones?

Each HTML to Docbook convertors that I could find had at least some of the following limitations, some of which stemmed from their intended use as one-time-only convertors for legacy documents:

I got this error. What does it mean?

Q. Fatal Error! The element type "br" must be terminated by the matching end-tag "</br>".
A. Your document is HTML, not XHTML. You need to fix it, or run it through Tidy first.
Q. My output document is empty except for the <?xml version="1.0" encoding="UTF-8"?> line.
A. The document is missing a namespace declaration. See the example for an example.
Q. Some of the headers and document sections are repeated multiple times.
A. The document has out-of-sequence headers, such as h1 followed by h3 (instead of h2). This won't work.
Q. Fatal Error! The prefix "db" for element "db:footnote" is not bound.
A. You haven't declared the db namespace prefix. See the example for an example.

Implementation Notes

Bugs

Limitations

Wishlist

Design Notes

The Docbook Namespace

&html2db; accepts elements in the "Docbook namespace" in XHTML source. This namespace is urn:docbook.

This isn't technically correct. Docbook doesn't really have a namespace, and if it did, it wouldn't be this one. RFC 3151 suggests urn:publicid:-:OASIS:DTD+DocBook+XML+V4.1.2:EN as the Docbook namespace.

There two problems with the RFC 3151 namespace. First, it's long and hard to remember. Second, it's limited to Docbook v4.1.2 &emdash; but &html2db; works with other versions of Docbook too, which would presumably have other namespaces. I think it's more useful to underspecify the Docbook version in the spec for this tool. Docbook itself underspecifies the version completely, by avoiding a namespace at all, but when mixing Docbook and XHTML elements I find it useful to be more specific than that.

History

The original version of &html2db; was written by Oliver Steele, as part of the Laszlo Systems, Inc. documentation effort. We had a set of custom stylesheets that formatted and added linking information to programming-language elements such as classname and tagname, and added Table-of-Contents to chapter documentation and numbers examples.

As the documentation set grew, the doc team (John Sundman) requested features such as inter-chapter navigation, callouts, and index and glossary elements. I was able to beat all of these back except for navigation, which seemed critical. After a few days trying to implement this, I decided it would be simpler to convert the subset of XHTML that we used into a subset of Docbook, and use the latter to add navigation. (Once this was done, the other features came for free.)

During my August 2004 "sabbatical", I factored the general html2db code out from the Laszlo-specific code, refactored and otherwise cleaned it up, and wrote this documentation.

Credits

&html2db; was written by Oliver Steele, as part of the Laszlo Systems, Inc. documentation effort.