S o f t w a r e  T e c h n o l o g y
   Magazine                                                      Quarterly                                             Columns and Articles




















Software Technology Magazine: Home Page Search this Site and the Internet. Print this article to a laser printer. Email this article. Save this article for future reference. Acronyms and technical definitions. Communicate your ideas on this article to others.




   


Data Capture and Conversion: XML's
'Last Mile'

Like the high-bandwidth optical fiber, satellite, and wireless fat pipes of the Internet striving to reach across the proverbial 'last mile' to American homes, the XML Express is about to run smack into the worlds of data definitions, directory services, information engineering, data management, and the art of data conversion. Despite all the vendors touting 'one size fits all' solutions, only intensive pick-and-shovel work, backed by a thorough understanding of sound software engineering and information architecture principles, can lead to triumph and success - but the resulting competitive advantage, operations cost reduction, and constantly-renewed customer appreciation make it well worthwhile.


by Max Chapman

As civilization moves further into an e-business future, open systems facilities such as XML will assume a greater role in preparing not just current, but archived, data sources for general and global access and distribution. During FY 2000, the National Archives completed a study and selection of technologies to prepare archived Federal Government records for the future. The National Archives eschewed such technologies as Oracle, DB2, SQL Server, word processing formats, various unix- and Microsoft-based solutions, text patterns on hard disk or tape, and others in favor of XML. XML was cited to be the most open computer-accessible format, because XML-demarked data could exist equally on paper (itself in OCR-able form) or on any computer media. XML was therefore seen by the Archives to be the one format most likely to last for centuries in recoverable and readable form.

In addition to Federal Government departments, many State, Local, and educational organizations are contemplating similar projects. So it seems that the promise of XML will always cross paths with the decades-old challenges of managing data and maintaining a clean, accurate, well-culled database — the realm of DBAs and information managers. But XML gurus are discovering that one can't send XML data and messages either around-the-world or from server-to-server unless the XML schemas and the XML-formatted data is present and available — and properly interpretable context.

One soldier of distinction in the war to convert America's largest repositories of information, data, and publications is Mark Gross, President of Data Conversion Laboratory (DCLab) of Fresh Meadows, N.Y. Dispatched to do battle in adapting some of the world's most massive archives of legacy data to the brave new world of XML — including the barely-conceivable 'data galaxies' of General Motors, the National Institute of Health's National Library of Medicine, and the United States Library of Congress, this gladiator-general has done the XML data conversion equivalent of conquering Gaul and returning to Rome with head bloodied, but unbowed.

Over 100 million pages of data have been converted by DCLab in the twenty years that Mark Gross has lead data conversion efforts. DCLab has been a founding member of the XML/SGML users group and has performed extensive and advanced data conversion work for General Motors, Lucent Technologies, Rolls Royce, John Deere, Xerox, Gulfstream, and many others. Advanced medical and E-commerce customers currently include Web MD, Ovid, Earthweb, HighWire, Classwell, MDConsult, and others.

For the Federal Government, DCLab has honorably and successfully completed contracts for the Defense Department, all three armed services, and several major defense contractors including Lockheed and Boeing, as well as various Federal Agencies and Departments such as NIH, the FAA, the USDA, and others. DCLab work has been held in high regard by the major technical and professional societies including the IEEE, Society of Petroleum Engineers, The Water Environment Federation, and the American Society for Metals, to name a few.

Some of the world's largest libraries are having Gross and DCLab protect their archives for the future. These include the Library of Congress, the New York Public Library, National Library of Medicine (NIH), the University of Pennsylvania, and Brigham Young University (heritage) libraries, to name just a few. In a related NBC Nightly News broadcast (April 30, 2001), a student at the University of Virginia expressed her gratitude that she was able to access the University's Special Collections over the Web for materials that were previously barred from public view owing to their rarity and physical fragility — thus validating a new Library Sciences technology and empowering advanced research capabilities made possible largely via XML. This is exemplary of just one of new cost-effective capabilities afforded libraries converting archives to XML — and to the Publications Industry's presentation-oriented XML predecessor and superset, SGML. DocBook, a system for writing structured documents using a combination of XML and SGML, is also popular. World-class publishers, including McGraw Hill, John Wiley, the British Medical Association, Mosby, Inc., Lippincott Williams, and Blackwell Sciences rely on Gross and DCLab for their access to multimedia and electronic mass outlets.

Author of the Data Conversion chapter in the "The XML Handbook" ubiquitously present in most data analysts' and information engineers' collections, Gross has struggled with major issues related to data conversion and XML, as well as W3C standards and the physical and cost-containment challenge of the data conversion 'manufacturing process'. Gross says DCLab has a representative on the OEB (Open eBook) subcommittee (Publications Structure Working Group, working with NIST) for XML e-book standards. "We're practitioners involved in the evolution of XML. However, a lot of time stuff from panels isn't implementable" in the real world, he says.

Gross believes software engineers need to constantly adapt to new technology. "The history of our company is about it. We started off twenty years ago, well before XML and SGML existed. When I chart it out, there were probably five or six points in time at which we re-invented the company ... First there were word processors, then electronic publishing, then came CD-ROMs and all, then came SGML — XML and HTML are fairly recent. So there's been a constant need to adapt as new technologies arise, even in all of the current architectures — which we do have to change all at once. We have to adapt to the marketplace and what is needed. So what we have to produce has changed dramatically."

In working with his clients to develop a grand architecture for large XML repositories, Gross focused on where technology could be in future years, and how information from these archives could be disseminated. The traditional image-storing methods that once led to the establishment of AIIM and were used by forms-processing-oriented enterprises is now seen as totally inadequate, owing to the need to disseminate information in many forms for many purposes. Most major corporations now see that the recurring and inevitable need for "formatting yesterday's information to fit today's uses" has become a major labor sink-hole that needs to be tackled.

When asked whether DCLab has considered writing COTS software to make these technologies available on the open market, Gross responded that it has always been considered at least once a year when DCLab does its strategic plans. However, the mass-market is "not really the market we play in. There is COTS software available and already on the market, but it's mostly for simpler kinds of things. The reality of much of the more complex materials we work with is that it really needs to be tailored ... we're not really writing the software for our customers. We have a software suite that we're using. And we're basically tailoring it as we work with each customer — we're engineering the front end."

"Now, where customers are going to have ongoing needs for the same kinds of materials, once we've tailored it and run it through thirty, forty, fifty thousand pages, then it makes sense to license them to configure it on their systems. And we have customers that have done that. [However,] most of the time it doesn't pay for them to have their staff perform XML data conversion and to constantly monitor the materials coming out of production, because they have their regular business to do. It's not day-by-day, every day we're going to do eight thousand pages. There are peaks and valleys. And so building the clients' XML conversion expertise internally doesn't usually make sense."

"Over half our business today is ongoing needs where we've worked with customers for years, and will continue hopefully over a future stream of years. But the materials change ever so slightly all the time. Authors work in different ways, the XML market out there changes, they have different needs, so I think they're finding it just doesn't make sense to build their own internal staff, usually."


CHART

With Permission, Source Cited: A chart similar to "How do the Different Technologies Stack Up?" from March 2001 FOSE presentation.


A few of the forms of data and information dissemination methods currently in vogue, e.g., TIFF, PDF, HTML, SGML, and XML are shown in comparison [citing chart above] to an average of common elements of "native" formats (i.e., word processing formats in common use today). Most paper-based forms and publications, computer data on various media, pictures and drawings, microfiche, etc. fall into one of the chart's categories for conversion to XML. While XML and SGML are closely related, the publishing industry and libraries often use SGML since they started their efforts earlier before XML was well-established. Organizations starting now will often use XML since it is somewhat easier to get started with. While the initial costs of SGML are somewhat higher, the total costs over a project lifecycle for a large project are fairly comparable. The startup costs for archiving in XML are a bit lower, Gross says, primarily due to the fact that for XML, "it's a little easier to implement the tools." The real boost in productivity of XML tools has come forth only in the last year, he says. However, even today the more sophisticated, time-tested tools exist primarily in the world of SGML. In many cases, however, these tools can be set to work productively with XML.

Gross says his company has built a suite of incredibly productive tools for throughput management and quality excellence — but these are designed for high-throughput organizations and are intended only for internal use. XML has, for example, the ability to define and enforce certain data elements which can be used exclusively for e-commerce and the Web. For editing work, however, Gross views with favor the classic Arbortext suites, Frame+SGML (which will work with XML when certain options are set), and FieldCheck.


SIDEBAR

Gross sees current major data content use issues to be:

       Representation of Distributed Page Images: The ability to produce and distribute an exact appearance or reproduction of an original source page, replete with original fonts, composition, and page integrity

       Repurposing: The ability to create, from old and ongoing sources, new versions of data suitable for derivative uses (e.g., the Web, diagnostic equipment, wireless and hand-held devices, voice-operated systems, etc.)

       Searching: The ability to locate and retrieve information via text searches, specifically including advanced searches relying on information context and "understanding"

       Component Re -Use: The capability to re-use portions or 'blocks' of data and information for different end products, purposes, and documentation sets

       Enforcement of Data Standards: The ability to ensure, and provide assurance, that the information produced and disseminated is done so consistently and meets enterprise standards.

       Interoperability and Interchange with Vendors, Customers, and the World: The ability of internal and external sources to use converted information for communication with external recipients and, further, to use this information as inclusions in content belonging to other enterprises or organizations (e.g., via syndication).


"The Library of Congress is in many ways an unusual project. Our area is mostly textual- and publishing-related. And what we do for the Library of Congress is SGML/XML archiving of old materials mainly from paper." Gross says none of the LOC data being converted is from old computer files. "It's paper, and some of it originates from the 1700's. It's scanned with special low-light scanners back at the Library." Even to date, the work completed has not gotten up to any publications or information that has been stored on computer media (if, in fact, the LOC even has any available — most of the Federal information retained on computerized media goes either to the National Archives or to the repositories in the various Departments and agencies, e.g., NIH, the IRS and Treasury, the DoJ, the DoD, and others).

Gross confirmed the huge conversion effort was not just a technical challenge, but a people challenge. Obtaining talented personnel, developing systems and procedures, and designing and implementing training programs were key. "There's a lot of that. Especially with the libraries, e.g., the Library of Congress, there's a very wide range of materials. The challenge ... there's one challenge of scanning, and there's the other challenge of the OCR getting good results. The text, and the third challenge, is actually the markup — it's actually SGML. Considering the technology, it could have been XML — it'd be pretty much the same, except that large organizations that start with SGML usually continue in SGML in order to retain consistency. If they were starting the project today, it would be XML — with stylesheets and appropriate formatting considerations. It's functionally the same. Since the LOC started in SGML, for consistency, this output format is continued. There's no point in re-converting it all exclusively to XML. But the SGML that is produced is all XML-compliant. We do this as a regular matter of course," says Gross.

"The third challenge is the 'Markup'. Getting the production software to consistently tag a wide range of materials is very difficult. At DCL, we have a whole process to define specifications, identify all the different things that might come up in a document, and get the customer's input on how they should be tagged. The LOC requirements document is drawn up by DCLab upon study of the client's materials, equipment, and facilities. When the Government representative or client signs off on this formal document based on the sample work and corresponding to the specifications, the customer can see the quality output they're getting and can then load it into their test systems. And — that works! It worked even for the Library of Congress, where the range of materials could be anything."

Gross agreed that Federal repositories of data such as those of AFIP and the Library of Congress stretching back into the 1700's were likely to have issues of data definitions, structures, storage methods and media changing over time and posing challenges for XML and SGML encoding. Even acknowledged content experts and professionals within the client may initially have differing views on how data exceptions should be handled. "Federal data has changed, so how you define it, what fields you need and how they work together and what's important has changed over time, and is guaranteed to change over time. Medicine will continue to improve, data points will change, all that will change."

Gross sees the benefit of XML and its derivatives to be more closely tied to solving the Tower of Babel problems that the plethora of hardware and software vendors cause. "I don't think XML is intended to be a solution to any of the data definition challenges. I think it is intended to be a solution to something else. I think it is intended to be a solution to define what the data will be, in a unified, standardized way, so that you don't have to deal with the intricacies of this DBMS or microcomputer vs. that. This computer's using Oracle or DBASE, that computer's using some SQL Server or other, etc. XML allows you to define what your data is; the interrelationships between the data, etc., in a way that everyone agrees is a standard way of doing business. And then your application (SQL Server or whatever) should be able to read a data layout and make whatever changes it needs to load the data. XML is not currently a native form that you actually can use in a computer [i.e., generateable or usable by applications such as word processors or email — Ed]. But you've loaded something," says Gross.

"This is a way that you can define your database so that everybody knows how to read it using standard, published standards. Not only that, but people know how to produce it. You can require standard XML of whoever is the vendor. Or the typesetter. Or whatever the data, 'I want my data back as XML files.' And you can define what your XML files are. It's this DTD, or that Style Sheet, these are the allowable tags, and all of them are allowed, and you say, 'This is the way I want them back.' It's a very definable quality. So it provides a solution. That doesn't mean it's not going to change over time. But as it changes over time, you can very easily refine your mappings to a new data base."

Gross observed that cross-fertilization of emerging technologies among the Federal sector, the Department of Defense, and commercial enterprise lead to the development of XML. "I think there's a lot of that. And even a lot of the SGML — later, XML — technologies developed out of the Government with the military and the DoD. Actually, it didn't move there for a while, but what got developed there started getting used in industry because there were good business reasons for it to work. It was necessary. So a lot of our work over the last five years has really been in industry and in organizations, etc. And now the DoD is making some major inroads to move their materials into electronic form and a lot of stuff that got developed in industry is being applied back."

Rather than industry standards per se, hands-on work by the practitioners of XML data conversion and encoding like DCLab was seen to be the phenomenon advancing this key technology. "Constant cross-fertilization occurs, since all of this is public. We're doing work for the Library of Congress and also doing work for the National Library of Medicine at NIH. There's other work being done for other universities — we've worked for Brigham Young University, which, while not exactly the same thing, is related kind of work. I suspect development that happened there will now cross-fertilize to other libraries. Other libraries include N.Y. Public Library, the NYU Law Library, and Harvard University. The whole library area I think is going to explode over the next few years because there's so much material that's inaccessible today. Most of the work we're doing today is with materials that are not easily available: Special Collections that exist in only one place." Hence, the favorable reaction of users and researchers, such as the student from the University of Virginia.

These Special Collections can exist at the LOC "Or the library of Brigham Young University, or Harvard University's library. They've got collections of materials that are unique that normally nobody can get to, because they're all rare collections — one of a kind. By making them available over the Web, and putting the XML tagging in the right place, it allows them to be accessed very widely" says Gross.

A key goal of data conversion to XML is to encode or mark up data only once, then produce many products from this markup instantaneously and inexpensively. Other major goals include facilitating semantically complex searches from XML-encoded data and re-using the XML-formatted data many times in many ways. XML encoding promulgates the interchange of data freely, enabling and improving man-machine communication by reducing technical content and allowing functional or industry communities to agree on data vocabularies and structural organization while preparing data to usefully exist for a long time.

Yet, underneath it all, raw XML files are just plain text — in Unicode or ASCII, fine for text editing, database computing, word processing, keyed searching, etc., or even the native file format of any software system or package — but, without any supporting presentation language, totally unsuitable for any form of higher-level presentation or display. Hence, the addition of XLST capabilities and stylesheets — both internal and external to the XML file. XML-tagged data is used for information interchange by data aggregators for scientific and journal websites, the semiconductor industry, various medical researchers, etc. Even in the pure text form, XML-tagged data can be used by the enterprise throughout the life-cycle of a product (and exchanged among stove-pipe divisions), in direct machine-to-machine control and data transfer, between proprietary formats in software and servers, in inter-process communication (IPC and RPCs), and in transactions and reporting among business partners (B2B, B2C, G2B, G2C, etc.).

The use of XML tags and schemas to make contextual and multi-stage-search decisions in an indexed search is evident; for example, a search for all pathologists findings based on white patches of skin can now be made to skip over pathology reports signed by Dr. White, patients resident at the Our-Lady-in-White Hospital, contributing reports from labs located along White Avenue, and a reference by a nurse to a white elephant piece of equipment. XML text tagging can spare researchers (or anyone) from having to retrieve masses of results they don't want, from not getting the results they do want, from not knowing if they got all the results they sought, and from, most dreaded of all, the manual sub-search (extensive manual culling could become a thing of the past, Alta Vista users might be happy to hear!).

The organization of XML-formatted data include DTDs and styles for document presentation and use (printed and displayed output, etc.) while enabling the exchange of style sheet information. Since no general or overarching standards for formatting have become widely accepted, they currently include (in estimated degree of usage) W3Cs early offerings, XSL (Extended Stylesheet Language) and XSLFO (Formatting Objects), the United-Nations-backed ebXML, DSSSL (Data Style Sheet and Semantics Language), Microsoft's Cascading Style Sheets (CSS), AXE, Commerce One's NAS, and proprietary stylesheet methods by Oracle and others. Those standards using APIs, objects, and other active code calls (e.g., SOAP) rely primarily on the Document Object Model (DOM).

The two key success criteria for a stylesheet regimen are: one stylesheet, many documents (maintaining consistency of format, "look and feel" across many documents) and one document, many stylesheets (feeding to different media types and generating different derivative documents — e.g., selections, abstracts, summaries, indexes, catalogs, and others).

While conventional wisdom is that, for XML to be universally useful, all participants must use the same tags, real world interoperability requires some transformation of "my tags into your tags," if only to expand the business use of XML. It is possible and reasonable to alias elements and content when the elements are semantically the same but using different names (PO vs. Purchase Order, etc.), applied iotas are semantically or functionally close enough (postal code for zip code), and when one set of elements are recombinations or subsets of others.

Perhaps the most easily-available stylesheet method for preparing XML data for presentation, XSL, uses the XSL Transformation (XSLT) style-sheet processing facility for translating (while retaining schema to the degree possible) a given set of XML tags into HTML versions for browsers; other XML tag sets for messaging or further processing; straight text (Unicode or ASCII) for insertion into database records; and, the non-XML tag sets of the alternate stylesheet systems mentioned previously (SGML carries its own styling notation).

Likewise, DTDs can be customized and applied for different purposes — Authoring to build a document or database (very strict rules and enforcement), Interchange to exchange data (very loose rules in order to be enabling), Conversion (so loose as just to be a data classification tagset — primarily applied to describe legacy collections), and Output to prepare the data for information re-arrangement or subset creation. For print and Web publishing, XML provides one manageable source capable of feeding to many different outputs and media device types (Web, CD-ROM, handheld PDA, email and wireless units, voice synthesis, etc., as well as all forms of print and display). In computational processing, XML provides adaptability for different hardware, software, operating systems, etc., for input, manipulation, and edit or display. But most critically, Publish on Demand (POD) withcustomized output is enabled for output at publishers' plants. syndicated Web sites, and publication stations anywhere in the world.

The XML 'data centric' integration model for integrating services on the Web allows robust and open, 'loose' coupling of applications. Using the XML data comm model, the Web becomes friendly to other media and provides for data and information syndication, on-demand publication, independent (and thus more true to actual appearance! Ask anyone who prints web graphics directly from a browser) print styling from Web display and presentation systems, and dynamic data transfers into and out of Web-page-supporting databases. XML's data-centric integration model can additionally source dynamic Web page systems directly from existing enterprise servers or, via messaging, even certain legacy systems set up for near-real-time disk-based data transfer.

Extending this model to the enterprise, an XML data 'source' can, under direction of one or more servers, stream formatted data output to many displays, printers and devices, with self-pre-formatting, and arbitrate, dialog, and control the flow of data among widely-scattered servers. Selections from a database of XSL files step up to bat for each destination and application in the requested presentation, processing, and storage layers. Under this model, XML-represented data can be, and has been, stored with complete indifference in object-oriented, relational, and hierarchical databases.

Publishing

Content management using XML repositories is a phenomenal new and cost-effective practice enabling data management at many levels of granularity. Further, it fosters the ability to combine data from many sources on the fly using increased searching precision and facilitates the re-use and repurposing of the data (via electronic "slice-and-dice" partitioning on a hierarchical basis). Content feeds and customization can be controlled from a central repository gated by an enterprise portal via resident business logic servers. Much of content data can be stored in an ordinary relational database in untagged format and extracted using automated or automation-support SQL queries and reports while wrapping descriptive or destination-process tags around the data (using a metadata-based generator) as it is fed to the destination or transmission medium. Alternatively, data can be stored in the database in pre-tagged XML form, although this approach limits additional granularity and future schema change. For on-demand publishing, XML/XSLT transformations and boilerplate additions can prepare content for print publication via Adobe PDF, Postscript, HTML, software and hardware typesetters, TIFF image, SGML print engines, proprietary formats such as HPGL, RTF (Microsoft Word's 'open format') and .MIF (Framemaker) outputs, Quark files, etc.

Context exclusion within content domains is a key value to the retrieval capability of XML repositories. Redaction or selection can be performed according to tag identification of background, references, experimental design notation, rejected options, methodology, acknowledgements, introduction, history, examples, geographic names, etc. — or any portion separately identifiable by concept. Consequently, to enable context exclusion in XML extraction and processing, good XML markup should identify unimportant information (according to expected users or usage), secondary or supporting information, negated information, or large stretches of detail that can be ignored.

Markup to support navigation to specific content areas involve tagging metadata, linkages, cross-references, bibliographic information, digital object identifiers (pictures, sound representations), maps, indices, etc. Metadata which is important to incorporate in XML implementations for large or serious archives or data stores include application metadata (UML, models, specifications, requirements), content access vocabularies and subject terms, properties of user interfaces, security (levels, access, authority), version control (including edition tracking and permanent configuration management), and bibliographic data for publishing.

The normal practice is to populate the archive repository in XML and the bulk of applicable metadata in a separate database. However, a good deal of metadata is built into the XML elements (e.g., bibliographic headers of journal articles, index terms and keywords both prior and embedded in content) and into XML attributes (e.g., the who-what-why-and-wherefores associated with the schema structure, security-level indicators, and information class or content type reflected in that structure).

Archiving for Publication Support

Publication, with its complex conceptual structures, indexing and cross-reference needs, and multivariate output requirements, is arguably the most complex challenge an XML systems architect can face. XML archives and data stores must be designed to "snap in" to both complex data content logical structures and new output formats as well.

Despite the superiority of XML/SGML for archival and central storage purposes, XML content without formatted text-and-image output such as PDF or TIFF is not a option for most publishers. Similarly, most libraries and publishers have Web sites linked to at least part of their content. Combining the strengths and offsetting the weaknesses of the various publication modes is a key part of a good XML system architecture.

While proprietary, the Adobe PDF format offers tight interchange and precise and faithful reproduction and display of documents and pages. It is easily constructible from postscript files, SGML with TIFF imaging, or XML/XSL sources, and can be used for electronic communications (esp. CD-ROM and Internet) document delivery, pre-press work, online repositories, Web-sourced white papers, and long-term archiving (solely of the printable aspect of the file). In static archive comparison, a PDF document is all in one piece, all printable text is included, and the "look and feel" is as the author intended (assuming good paper conversion). XML documents may consist of many parts or part of many systems (e.g., including stylesheet files and relevant database-resident metadata). Some text may be generated and therefore not searchable, and, depending on stylesheet support, original formatting may be anywhere from 'as the author intended' to lost forever.

It makes sense to distribute the PDF format when all (generated) text must legally be present; when page layout, design, or location on the page is critical; when pages are more important than information and reading is the primary objective (as opposed to search, extraction, inclusion, or re-use); when the source has no repeatable content or capturable structure (as in many Quark pages); and when read-only archives, legal documents, or fast proof copies (galley equivalent) are required.

XML content is distributed in preference to PDF files when more than a standard, page-oriented print or display is necessary, when security issues and search indexing require complex processing (including digital signatures and document extracts created on-the-fly base on a user's clearance), when text generation or further processing, selection, or re-use or recombination is necessary, when fine granularity is required, and where the components of information are more important than the pages.

While the PDF output format will continue to be a popular pre-press and archival form, advanced XML publishing systems producing PDF and HTML (or its successors) as options will gain in share and ultimately 'sweep the field' of competitors. While searchable PDF (limited search capability) may make short-term gains over the more complex omnibus XML implementations requiring more technical skills, this trend will reverse in the long term as the six overarching capabilities of XML are recognized and demanded and the limitations of the PDF format are recognized. In the meantime, it will take a format of pretty attractive pages and snazzy performance to wean publishers completely off PDF — a proven format that is 'easy but expensive' (at least, when publishers are creating content in licensed Adobe e-format!).

eBooks are not a competitor to XML or an output option, nor are they, as commonly thought, a different set of hardware platform standards for mobile display, but rather, applications optimized to easy viewing and reading. At present and by default, HTML is the most common format for eBooks in actual use today. As applied to eBooks, HTML is familiar, easy to create from many authoring and typesetting data formats, readable on most eBook platforms, offers easy basic formatting (though very poor font and picture placement precision), and allows hyperlinking, a major advantage for eBook readers. However, variations in display unit behavior, like that of the proprietary browsers, can cause problems and useful search capabilities are severely limited.

PDF-format eBooks are popular with electronic libraries and some publishing houses. eBooks in this format are familiar to most Web users, are viewable identically on most platforms, preserve the author's original professional layout and appearance, allow limited linking, and are easily created via automated means from any format that can be printed. Disadvantages are that non-trivial search capabilities are limited, re-use of text and content by viewers is limited and awkward at best, and PDF pages which are designed for print are often very difficult to read on small, monocolor, or low-resolution screens.

eBook content sources formatted in XML are currently limited, but rapidly growing. They are backed by the Open eBook Specification (Basic for HTML, and Extended for full XML, but no standards for security), which has as its Presentation Layer formatting standard the Microsoft-created Cascading Style Sheets (CSS). High quality search and retrieval is possible, re-use and re-purposing is easy (a must for professionals, writers, avid human-networking readers, and literary critics), and there are rich linking and hyperlinking capabilities. Disadvantages are that it may foster dependence on Microsoft, XML eBook formats are unfamiliar to most publishers, and additional investment in skilled tagging labor is required.

XML Publication Issues

Some time ago, and even today, the appropriate form of tables for inclusion into the XML standard was at issue. Most publishers, archivists, and XML professionals lean towards the CALS table model, which was developed by the Federal Government and DoD for use within SGML. "CALS spawned a series of SGML DTDs which are still certainly used, but one of the most important contributions was the CALS table model. It developed into a standard way of representing tables. It was effective in most implementations and started getting used in lots of other places. So the standard CALS table model is what a lot of people are actually using [for XML table encoding] outside of the CALS community," says Gross.

The full (SGML) CALS table model is the most powerful, complex, and comprehensive table model available. For example, it has a rich, and widely varied, vocabulary and definition-description set for the cell-characterization parameter ("%tbl.entry.mdl"), including such fancy niceties as FootnoteRef, Xref, Citation, CitRefEntry, ForeignPhrase, Trademark, CalloutList, ItemizedList, OrderedList, GUIMenu, MediaLabel, AuthorInitials, RevHistory, FuncSynopsis, et al (these from the Orion DocBook). Other and more conventional parameters present in the "pure XML extraction" of the CALS table model include tbl.table.name, tgroup, colspec, thead, tbody, tfoot, row, entry, rowsep, pgwide, %bodyatt, %tbl.table.att, align, valign, colnum, colname, colwidth, colsep, etc., plus a few other perfectly intuitive variables. Limited nesting of tables within tables is possible.

However, many XML users favor the newer HTML 4.0 table model, because of its sheer simplicity in rendition. The HTML table model is capable of reproducing 90% of the table structures that the CALS table model can generate. Parameters or elements of an HTML 4.0 table (the so-called Complex Table Model — 3.0 is considered the Simple Table Model) include TABLE, COLGROUP, COL, CAPTION, THEAD, TBODY, TFOOT, TR, TH, TD, FRAME, AND RULES. Table nesting is unlimited, but rendered results are unpredictable beyond five levels in most browsers. Lastly, user-defined table models are used for special cases or to adapt to proprietary applications (e.g., Oracle data tables), but are severely limited in that they are not recognized by generic XML applications or most output rendering software.

Another challenging area for XML archivists and data conversion engineers is the tricky XML encoding requirements engendered by mathematical and chemical formulas and scientific modes of expression. "We may have enough tools to work with," says Gross, "but there are a few areas that still need work. Math is not particularly well done in XML. Once you get into math, and built-up math encoding and rendering capabilities, there is a mathML specification which defines this. But it's not particularly well-tested yet and there are few tools in availability that work with it particularly well."

Advanced mathematics formulas can be viewed as more comparable to font art than to standard typography (we're not talking high-school algebra, here!). Whereas virtually all Western composition exists in the form of characters or letters (Roman, Greek, or other) juxtaposed in an adjacent manner to form words and lines or rows of text, mathematics and chemical symbols are abstract, widely overlapping, often far taller or wider than any typical font character or standard text line height, and seemingly placed at random on a two-dimensional plane in an arbitrary location on the page. The superscript and subscript challenges of chemical formulas in simple notation are challenging enough, but as one progresses through compound benzene and toluene rings to more complex organics up to symbol placements in a helical DNA snippet or three-dimensional diagrams of protein structures (reduced to two dimensions, of course) seem closer to Modern Art than typography and are more often than not renderable only as TIFFs or other pixilated depictions. The n amorphous, two-dimensional nature of most complex formulas put XML (or any!) encoding beyond the realm of scanners and most OCR devices. Many complex and technical chemical diagrams are indeed beyond rendition by software logic into computerized diagrammatic vectors.

A chemical industry standard, cML, has been created to represent most of the simpler, standardized ways of representing chemical formulas. This seems to be reliable in rendering and satisfactory to most users. However, as mentioned earlier, an arcane and monstrously complex competing schema and formatting facility for mathematics has been developed; specifically, the cited but evolving standard of mathML. Relative to cML, MathML is quite tricky in practice, contains many entities, elements, and parameters which must be explicitly set, and often only hand tweaking and testing of formulas encoded in mathML and its templates will yield satisfactory results in print and upon display. In both, there is no fully-automated conversion, and manual encoding is a highly-skilled art of formula-rendering data analysts and practitioners.

Most ordinary browsers, of course, do not support something of this complexity. Special browsers designed for use in scientific environments are expensive, but available. "The standards groups can put together the definition of standards and get them approved, but then they don't come alive until private organizations start building tools to work with them. So in order for math and chemical equations to be used with XML easily, you have to have tools that allow an author to easily create math on a screen and render the math properly on the way out [to the XML archive], and those capabilities are not yet well developed," Gross said. Depending on the complexity of the formulas, hand editing at the display screen more often becomes the rule rather than the exception.

Compared to XML, HTML is a tiny vocabulary of presentation-oriented display-and-formatting tags. As anyone who has constructed a Web page will attest, these tags often behave very differently in the different proprietary browsers in common use today. HTML content is very difficult to scale, integrate, index, and mix with other publication formats, while XML benefits by user-defined tag sets, separate syntax and semantics (i.e., via a layered architecture), and scales and integrates well — and XML can support high-quality content presentation directly through XSL style sheets, or indirectly though HTML/XHTML, PDF, etc., conversion.

As new browsers are written, XML will replace HTML and its successors in most new Web sites, though the overwhelming existing base amounts of legacy HTML will not be replaced, due to the significant 're-casting' labor cost involved. XML is also not economic for simple display-only or write-once pages and display of unstructured data. XML will replace HTML where users have needs for indexing and retrieval, effective security, more control over formatting, and more complex data requirements (e.g., semiconductor, airline, and multi-source publishing industries).

A common challenge, even in Library Science, is the handling of inclusions of non-text media and explicit references to remote inclusions as well as references to text passages in different physical (e.g., paper or microfiche frame) volumes or pages. Some volumes may not (or may never) exist in electronic form at the time a reference is encoded. Often the concept of ("See") Page 368 has no meaning for streaming XML content. Where footnotes fall depend not only on the characteristics of the output media, but on bibliographic organization and options as well (e.g., at end-of-page, chapter, section, or as an appendix?). In text or content, one may encounter, "Refer to Figure Above" — but, in a given instance of output rendering, the figure may fall below, to one side, into a different media, or be absent altogether! Automated cross-referencing help, but also can humorously link the reference "Refer to Figure 15.5" to a line in content such as "and so we figure 15.5 inches is about as long as a Rhinoceros' proboscis ever gets."

Xpointers are a derivative XML facility to enable content creators and archivists to accurately set up extensible and portable links and cross-references. Since all possible dissemination and output forms of an XML document cannot be predicted and tested, data analysts and content developers are not always able to determine all of the links that will be needed or how the pointers will be resolved in a certain rendering of output. Under Xpointers, when an XML-archived document is rendered for output, queries against a separate database maintaining cross-reference and anchor target locations are executed and the addresses of the references obtained. Then the referenced text passages or objects (diagrams, Figures, or pictures as may be the case) are retrieved and incorporated in the final rendered (i.e., printed or displayed) output (e.g., a PDF document for downloading and delivery to a requestor's PC via a Web site on the Internet). As always, and as exemplified above, the actual address of anchors (objects of the references or pointers) are subject to change, even after the content of an XML archive is created. Automated or manual re-resolution of anchor addresses may be required in every publication instance, depending on circumstances and complexity of the document.

Similarly, Xlink is a facility that allows hyperlinking of demarked text or multimedia objects within XML content. This allows displays of XML content or Web pages to hyperlink in a nearly identical manner as HTML Web pages. But again, the problem for both Xpointers and Xlinks are exemplified by the problem that most of us encounter weekly as we go to view a Web site or a listing of sites a search engine (e.g., Yahoo or Alta Vista) has retrieved: we click on a hyperlink that says, "Jo Blo's Hottest & Latest Info" and get Error 404 — Directory Not Found (or increasingly more often, "Server Does Not Have a DNS Entry or No Longer Exists"). Voila! A clear example of an anchor or target which has moved (i.e., been renamed) or has ceased to exist, making the execution of the automated reference pointer problematic for both the reader and for the rendered output destined for print or display. Another common example of the time-value failure of pre-resolved pointers in the HTML/Web world ('dead links') are those tiny annoying red 'x's evident where once a picture was displayed. Perhaps XML designers and content creators ought to take a tip from frequent Web search engine users who, when encountering useful content and despite the promise of the 'Net', do not add the Web site or URL directory to their browser's 'favorites' list, but instead, immediately download and store or print the content out — knowing full well that either the content or the URL won't be there tomorrow!

Information Engineering, Data Analysis, Data Modeling, and XML Vocabulary Development

Indexing and controlling subject vocabulary is a critical skill in the development of a flexible XML archival storage base successfully providing the six fundamental XML benefits. Natural language is messy; the better the writer, the more variation in the vocabulary, and data definition discipline is required. Texts often fail to state the obvious, such as their subjects. Subject indexing and cataloguing can be structured to give instant access to large, but precisely targeted, collections of materials (e.g., within multiple libraries). Structured thesauri can be employed to find more specific, or less specific, subject matter. Indexes can grant access to medium-sized collections (e.g., books, periodicals).

An XML vocabulary is defined as an XML element set plus data models (schemas, DTDs, metadata stores, etc.) designed, acquired and designated for a pre-selected range of functions, purposes, and requirements and a range of anticipated users. The problem domain that an XML vocabulary is employed to solve covers specific business and technical areas for one or more vertical industries or markets. The final vocabulary includes industry- or function-specific components such as cML, mathML, and vector graphics for functional areas that involve those sciences.

As with computer codes, giving tag preparation staff a carefully-designed (via functional subject-matter experts) set of options rather than fill-in-the-blanks forms, and requiring them to use the lexicon and subject matter grammar to encode limited choices, substantially increases precision. Use of a subject-indexing vocabulary at the finest grain affordable, while identifying the thesaurus or indexing vocabulary used (or well-designed key wordset, if no public or in-house vocabulary is available) is most likely to minimize problems down the road. Examples of a controlled 'public' vocabulary using subject codes include the Dewey Decimal System, the Library of Congress Classification System, the General Accounting Office Thesaurus, the Medical Abstracts Header List, concomitant vocabularies supporting the various industry XML schema standards, etc. Indexing and vocabulary development are intellectually-intensive efforts, and therefore costly, up front. While automated indexing is better than none, it's nowhere near as useful, accurate, and precise (not to mention less wasteful in operation!) as expert human indexing.

For example, data analysis expertise is often provided by the client to work with DCLab XML professionals. The first stage of data analysis is to collect appropriate data for modeling. This uses information engineering principles to determine what is relevant in the data, what portion is useful for the stated system or project goals and objectives, what can be identified in the data, what the logical and structural framework or 'scaffolding' is required to support the end result, and what constraints may affect the final architecture and design. Determining what's relevant in the data can include identifying key information, information-rich portions of the data mass, major subdivisions thereof, and content that has or can have many purposes, as well as which portions are unimportant. Usefulness, of course, exists solely in the perspective of the client — but the process of eliciting, extracting, and identifying it may require ongoing dialogue and skilled observation, as many clients are unable to elucidate their full understandings of the interrelationships of the data and enumerate all of the data or aspects thereof that are considered 'useful'.

XML Data Conversion Project Analysis and Engineering

Designing a framework for archival XML encoding consists of preparing a basic infrastructure (i.e., record structure for transaction-based XML messages or a hierarchical structure for traditional documents, architectural DTDs, logic or data relationship trees, etc.), extra 'containers' to support rendering and dissemination testing, project monitoring facilities for revision control/permissions/status tracking, and metadata concerning the project and vocabulary development itself.

Step-by-step information analysis includes performing: requirements determination, scope determination, document analysis, identification of documents or components and relationships among them, performance of user surveys or analyses, and development of the required conversion specifications. Requirements determination and scope-setting are the most critical activities. These drive the decisions for architecture, analysis, and design. They determine granularity of data and vocabulary, data and document relationships, and tradeoffs and conflicts between design and implementation. They additionally draw the line as to what won't be included in the initial project, but indicate new directions and capabilities that are desirable given future opportunity, additions to the client's mission requirements, or additional funding. Lastly, success for any project involves determination of organizational requirements. These can be ascertained by addressing such questions as who are the project stakeholders, who will benefit most, who holds the purse strings, who authorizes the work, who monitors progress, who stands to lose under various outcomes, and whether there is support within the organization from the bottom up or from the top down or both.

Once the fundamental goals, objectives, requirements, and organizational environment are nailed down, specific planning for detail-level design and prototype run evaluation can be completed. Determination of what can be tagged versus what should be tagged is ascertained, and hand-tagged samples are developed. Software engineering (and COTS and tools/utilities selection and hardware support) is implemented and a sample conversion is accomplished. Analysis of costs are made based on the sample run, and the economics of the larger-scale effort is estimated. Software is completed and a 'Hot List' created. Using this information, the site is staffed and prepared for production.

Gross emphasizes the importance of conducting a prototype trial run with hand-tagged samples. "When we're preparing materials for customers, very often it's the first time that any volume of material has gone through their DTD or through their style sheets. The first part of the conversion process is really testing whether what they've built actually stands up to the real materials that they'll be working with. A lot of times when the analysis gets done up front, the analysts are going through at most a few hundred pages or something like that, whereas the collections might be thousands, hundreds of thousands, or millions." In the FOSE presentation, he noted that even a five-thousand-page project needing a five-minute-per-page fix of a systemic production problem would cost 25,000 cumulative minutes, or 417 man-hours. On the basis of a seven-hour day, this could amount to almost two month's labor on the part of a temporary employee hired solely for that purpose — an expensive and delay-producing mistake. Therefore, preparation and up-front testing and feedback is the key. Gross quotes Abraham Lincoln: "If I had eight hours to chop down a big tree, I'd spend the first six sharpening my ax."

Data Conversion Production and Quality Assurance

In the DCLab approach, document tracking systems and production process controls are instituted, and process improvement feedback is reviewed during the ongoing production process. Handling and exception reporting are employed as standard aspects of production management, along with attention to packaging and delivery. Quality assurance is implemented via Initial Review and Final Review, with process improvement feedback monitored and employed all along the workflow process.

By virtue of experience and reputation, DCLab has grown to have been selected for the world's first large-scale conversion (General Motors), and for the world's largest conversion project (the U.S. Library of Congress). Are XML and SGML the Silver Bullets? "Not so fast ..." Gross says in his FOSE presentation. "Not all XML is created equal. There are differing uses for data." The 'round pegs in square holes' analogy may be appropriate. Since everyone uses data differently, new XML work provides an opportunity to design your data, there are new uses for data that can spring from encoding in XML, and these decisions are further subject to competitive differentiation. While XML is not a print format, is the ultimate XML standard format possible?

Implementation issues (as opposed to dissemination issues) include whether DTDs are necessary (for less complex content), and if so, should they be proprietary or Industry Standard? For new material, should you author in XML, or convert afterwards? And lastly, databases — how do you keep track of all the XML files? (cataloging? indexing? etc.) An XML property to be considered is that, even using a simple descriptive data structure, data can be first stored in a giant XML archive or server farm, then easily and inexpensively cataloged and indexed a 'hundred ways from Sunday' at any point in the future.

Gross believes multimedia data will soon become an integral part of XML implementations. Data archiving and communications are now supported by multimedia technologies from a variety of vendors and sources. "Sure, absolutely! Images are being incorporated — a lot of what we do is incorporating images and drawings and now, sounds. It's all opening up." says Gross.

XML tags make it possible to ignore the complexity and proprietary reality of how multimedia data is represented and stored in repositories and databases; e.g., either Microsoft or unix-based operating systems (or Mac OSX, or O/S) can carry the load. "Luckily, most of that is not what we have to deal with. It's all in the software product. Most of our customers spend more time actually worrying about how to link it together once they get it. Our job is to insert [multimedia] reference tags in the right places in the XML code. And we can do our linkages any way anybody wants. So that we can meet whatever specifications are necessary."

XML Futures

For those enterprises and publishers brave enough to contemplate authoring new material in XML, XML likely will change the way each employee works. Who does what work will be altered, often driving each labor category one notch up on the technical skills level (and thus payroll costs). The line between content edit and copy edit will blur. Proofing and checking changes include changes in level (less concern about transposed words, and more about missing structures), changes in specifics (don't have the software or MS Word search option look for a comma, look for ''), new possibilities open up, such as false color proofs (e.g., similar to the way Web editing software separately identifies tags from content using differing colored text displays), and lastly, one error may show up many times, but still be just one error (a systemic error replicated by software or a defective data structure).

New content XML creation is a true paradigm shift — working in data is not working in presentation appearance. Format is controlled via controlling tags and attributes in coordination — each one cannot be different. Forms, templates, and outlines prove useful. Procedures and workflow in a newly-XML-authoring organization will change. Job responsibilities may change, and quite definitely, structured writing may be a new way to think for most authors. Machines will perform some of the work formally done by authors, and authors will be doing some work never considered part of their job description before.

Thus, organizational change must be considered as the investment required for an enterprise to gain new powers, capabilities, and competitiveness in the publishing (etc.) industry. Changes will be necessary in staffing and jobs, but staffing levels will probably not be reduced — but not increased, either. Instead, the mix of talents will likely change. Training eager and technologically open-and-supportive staff will be key.

Similarly, proofing and re-keying cycles will likely be significantly shortened or eliminated — with the exception of math and chemistry formulas. XML authoring changes the nature of the grunt work so that far more time is devoted to content than to format. Transformation and dissemination of content will become a major activity — more post-processing work will be done, and the potential for the dissemination of more final published products, even to brand-new channels or on new distribution means and media, will be realized. As dramatic shifts in the mix of channels and media in the publishing and multimedia business occur over the next decade, those enterprises employing XML authoring will be, by virtue of being electronic and the lowest-cost producer, in a far better position to swiftly respond — and survive — and thrive!

While all such principles and phenomena related to archiving, data capture, data conversion, and data interpretation may at first seem to be more in the realm of the DBA, the content programmer, or the data analyst's details and not related to the application of XML, they illustrate the immutable need for coordination of contexts and frames-of-references among XML system architects, XML content creators or message senders, XML output end users and receivers, and ultimate XML archives, publication reference sources, and other data stores. Clearly, the stuff inserted between XML tags cannot just be regarded at its prima facie value — simple strings of text-and-number digits. Vocabulary, metadata, and structure are just the beginning of establishing context. Information is what needs to be transmitted and disseminated — and that is far more contextual and challenging than most marketeers of XML software and systems currently acknowledge. That's why Mark Gross and DCLab will likely continue to be in demand and courted by the Nation's largest enterprises for XML work — well into the future.

 
  Copyright ©  Software Technology Magazine. All rights reserved.