S
o
f
t
w
a
r
e
T
e
c
h
n
o
l
o
g
y
M
a
g
a
z
i
n
e
Real-World XML: Beyond the Hype
— STM Staff
Send To:
ONE email address! Use Back button after Send to repeat.
From:
Used <<<
only
>>> to identify sender.
Subject:
Feel free to add your own comments between paragraphs.
Remarks:
and
Text Body
This Article has been sent to you from Software Technology Magazine: http://www.softechmag.com ---------------------------------------------------------- Real-World XML: Beyond the Hype When they say, "The Devil's in the details," nothing is more truly spoken about a good XML implementation. Even a new and breezy technology like XML is subject to the need for rigorous application of the principles of software and information engineering and to the requirement for cogent systems of data structure and value definition, conversion, validation, interpretation, and application. XML is less about the Web or the Internet and more about a major paradigm shift in the transport and representation of data independent of technological context or mode of transmission. To illustrate, apart from bulk cost issues relative to shipping computer tapes, there is no reason why, other than cost, mainframes in the late 50's could not have been wired to telex terminals to transmit XML data content messages, printed on telegrams at the destination and read into receiving mainframes via the huge OCR machines of the time. XML has opened the process of developing inter-system and inter-enterprise data communications to a vast reservoir of non-programming, non-technical personnel. This becomes an important capability, because XML data is most effectively applied when used by workers skilled in an organization's functional (read data knowledgeable), rather than technical, areas. In the bygone age of mainframes, it was considered enough to have a consultant mistakable for a Hemmingway look-alike to write a routine in Btrieve or other arcane file utility producing a text-only flatfile that would then be read into a disparate system using a Rube-Goldberg-like contraption of IBM 3270 or RJE terminal emulators, etc. Often this required injecting into the "computer room" an unwelcome PC hosting an arbitrating DBase - an adaptation with a cumbersome and gawky presence that mainframe personnel never missed an opportunity to trip over. In this dawn of the age of inter-platform data communication, the process of data extraction and upload was painful, occurred in spurts or batches, and had to be continuously supervised by techs and data validation personnel during operation. Anguish and expense ensued when a data format changed after a key consultant had passed into history. Along came alternative data processing capabilities, with powerhouses like Oracle, Microsoft and SQL Server, Unix and PC servers, 100 Mhz Ethernet and fat Internet pipes, and vertical functional software suites like SAP, PeopleSoft, etc., leading the way. A different world emerged. Faster processing and communication totally changed the economics of data exchange. In the new age of immediacy and real-time processing, it now made sense to coordinate, link, and synchronize disparate and distributed databases using continuous streams of contextual messaging rather than by batch delivery of large but arbitrary blocks of raw data. XML filled the gap by offering an easy way to standardize data packaging while allowing data constructs and values to be fashioned by non-technical personnel. XML facilitated the expansion of computer-managed objects and media beyond raw character data to advanced media data formats such as sounds, pictures, diagrams (vector graphics), and even motion pictures and video. As a consequence, XML has been touted as a major foundation block of e-commerce and e-communication since the Internet and the Web came into being. While it is true that XML broke the collective logjam of mainframe mindsets as far as 'inter-species' large enterprise computer communications are concerned, XML does not in any way ameliorate the daily drudgery and sheer horror of data capture, data entry, data validation, and data interpretation. To the contrary, the superficial hype currently propelling XML into corporate data centers is likely to wantonly enthuse and confuse the great bulk of non-technical corporate executives and government- bureaucrats, while leading vendors, marketeers, and e-commerce proponents to wholeheartedly buy into the very mystical, mythological, and nebulous panaceas they are proposing. Each day, a thousand tiny software startups and hopefuls bang on the Corporate Data Center doors to be one of the chosen few to survive and thrive in the stark new 2001 world of pragmatic realism, hyper-competition, and "Show Me the Money." They often present products that are either not-thought-through (as in a stand-alone concept untested in full integration or applied to an untested market) or are insufficient in operation and execution. In more than a few cases, their first customers constituted the infamous 'gamma test', if not actually a continuation of product development. The incredible buy-ins of venture capitalists and Fortune 500 companies of the late 90's to the new e-business technologies demonstrate the enterprise's willingness to contemplate the emerging new technologies as useful business tools. But recent recalcitrant retrenchments in both stock prices and e-business technology purchases are demonstrative of new awareness that quite a lot of high-tech-hype chaff must be separated from the wheat of cost-effective and efficient implementation of new technologies providing real, and sustaining, business impact on market shares and the bottom line of positive and increasing profitability. What caused this explosion? Every master tech or swashbuckling executive desired to leap into the tech boom to be the one to lead a new startup and its technology into the next millennium - to bring in the Great White Whale in the grand tradition of Steve Jobs or Bill Gates. There was no shortage of boosters for the Star Trek view of the e-technologies future. New programmatic techniques and new ideas (along with marginal improvements in some very old technologies) spawned awesome phalanxes of tiny companies with impressively-cliché'd names, plus gobs of nascent new trade magazines to boost them. As speculative bubbles go, a good time was had by all. However, any pragmatism and reason that got in the way often got tossed over the side on the path to glory and Valhalla. In decades past, taking risk in the business management of technology has always been the exception rather than the rule. In fact, risk aversion is regarded as part of the very hallmark of a successful corporate executive. It was only during this period - this narrow window which saw a blitzkrieg of new publications trumpeting magical solutions to knotty computer support and consumer marketing problems - that enterprise managers appeared to throw caution to the winds. But the 'batteries not included' footnote to most of offered e-business panaceas turned out to be the required flotilla of high-tech staffies or the dump trucks of cash needed to rent them. All too often, when an 'out of the box' solution was proffered, it turned out that what they meant was you also needed to build and keep a box factory and a paper plant nearby and handy. Such is the case with XML in environments that require extensive interface or customization. At a recent E-Gov seminar where XML (and 'multi-lingual' server farms and portals, et al) were as usual touted as the answer to most common information system implementation problems, one of the presenters characterized XML as "a giant jobs program for geeks." As managers must always struggle with the near-rhetorical question of "beyond the bottom line, were the increased corporate capabilities worth the investment in time, labor, and technology?", all too often they find they must deal with the double whammy of cost overruns and mission creep. As a mechanism for communicating data between disparate systems, though, XML has no peer. Since XML's native state is readable text rather than in binary, as EDI (Electronic Data Interface) is, and since XML is separate from computer languages, mediums, and platforms as well as the drudgery of systems and software debugging, non-programming data analysts - the functional wizards who can benefit the most from good data transfer - are able to develop and construct the foundations of data transfer systems and operate them with far less technical support. XML can be productively used in degrees of simplicity stretching from free-form fashion (simple tag-delineated text with no supporting declaration of data structure) to formally-engineered schema (simple-to-complex hierarchical data structures - formerly known as "record layouts" to mainframe techs of the past). The latter may be expressed in a Document Type Declaration (DTD) or an XML style sheet. Templates and canned systems such as Microsoft's menu-driven BizTalk Server make it all the easier for non-techs to participate meaningfully in the development of extensive enterprise-wide systems. In the case of utilities such as BizTalk, simple screen diagrams aid analysts in rapidly converting all sorts of "data record formats" to XML data structure schemas and vice-versa. Perhaps one of the marvelous things the speculative e-technology bubble of the last few years accomplished was to compel many normally antagonistic competitors in the e-technology business to join to form W3C (and similar organizations) to agree on the various standards embodied in XML and all its cousins. Unquestionably, Europe, Asia, and the rest of the world now take it seriously. To date, only EDI and XML are in the running as cost-effective, high-level inter- and intra-enterprise and e-business data communications solutions. Linking Web sites and extracting and exchanging data among Web pages or among Web and enterprise applications servers is fairly easy using XML, which contrasts it with EDI, which requires programming and adherence to some rather arcane rules of hierarchical cataloging. In this vein, XML and EDI can be said to be at the two extreme ends of the data presentation layer spectrum - EDI at one end, being rigorously 'canned' and limited, but high in security for the same reason, and XML, with its incredible openness at the other end (to the point where some EDI backers joke that, with XML, when your XML software tools arrive, all you get out of the box is two brackets: a '>' and a '<'. What you do with them is up to you!). Because EDI is tokenized and transmitted in binary form, it is far more compact in data transmission. Robust XML messaging among middleware servers or compared with low-grade Internet connections, on the other hand, can gobble up bandwidth to an astonishing degree, reducing overall enterprise systems performance and running up general communications costs. Yet participating in EDI can also add the not inconsiderable expense of a private EDI value-added network - though this also enhances security over most form of public networks such as the Internet. And since EDI usually operates through the mailbox store-and-forward architecture of most traditional VANs, it sometimes requires up to 24 hours per transmission. Consequently, XML inter-enterprise transmissions are much more likely to result in business being conducted on a near real-time basis. Where customers assess the responsiveness of their suppliers, this difference can significantly enhance competitiveness. But both EDI and XML must adapt to the information architecture and data definitional needs of the enterprise. For example, one of the reasons EDI is facing declining usage is its limited range of pre-defined data structures. Quite often EDI users are required to use the 'Miscellaneous' fields in a data construct when critical but custom data is required by the enterprise - only to run out of 'Misc.' fields within the prescribed EDI data structure before the design project can be completed. In a flip version of the same problem, XML is not at present well suited for developing data structures on the fly capable of communicating with external enterprises, even within the same industry, since there are as yet no full sets of data constructs and data definitions (for payables, invoices, shipping data, inventory position and financial statements, etc.) universally agreed upon within the various vertical industries (e.g., foods, automotive, health, energy, etc.). These industry recommendations for standards are evolving and are expected to mature over the next few years. In June of 2001, W3C announced a new set of business-facilitating data structure standards, and more are expected down the road. Internally to the enterprise, XML is likely to become a major pillar of archiving and data warehousing (DW), owing to its flexibility in linking data flows among diverse platforms, and by extension, to knowledge management (KM) systems. Yet there are physical limitations in the centralized, 'Grand Plan' DW concept and architecture that pose problems for XML as well. For example, consider a traditional large corporation or enterprise peppered with typical functionally-organized legacy systems and other 'islands of automation'. To glue all of these together to form one central 'corporate memory' can be a major challenge - just getting the disparate platforms to exchange data is nowhere near the end of the challenge. Just as terabytes of digital '0's and '1's are meaningless until converted to readable numbers and text, raw data values are useless -- or worse -- misleading if not interpreted in the context that was intended when the data was created and stored -- in the original 'islands of automation'. Enterprise exec's encountered similar contextual and interpretation problems in the application of XML to Data Warehousing conversion and support. Since the originators of various 'iotas' (smallest coherent, interpretable pieces) of data within an enterprise have a conceptual model and purpose different than that of others who intend to use these iotas for unrelated or diverse applications, the founders of the DW concept and architectures discovered that they had to, at a minimum, recast or repackage data as it was drawn from the central DW store to a degree depending on the receiving organization and its designated application. How does DW become entangled in the destiny of XML? Precisely because only XML can store data (other than just pure numbers) in a form that can be retrieved for delivery to all devices and destinations in a variety of different formats. Information cannot come from data without context. Just storing the raw data is not enough - trying to interpret what was intended by the creators, publishers, or 'savers' of various data values cannot be accomplished by just retrieving it from a central repository - too late! Ancient texts prior to the Rosetta Stone meant nothing until archaeologists discovered their context in Greek. DW engineers discovered all of the definitional groundwork and training in context interpretation for users who (or servers which) will be at each end of a permanent XML data comm link must be set up and executed before the data starts to go into the central store. Ironing out many of the metadata and XML data structure and schema issues for a completely centralized system turns out to be costly, laborious, time consuming, and - most of all - political. Schema Challenges Developing pragmatic and successful schema for existing data, particularly data of an unusual nature, such as archived data stretching back a century or more is problematic for all information engineers, not just for XML schema architects. Compounding the complexity of XML conversion efforts has been the need to adapt to the wiles and ways of human beings: the ultimate end users and assessors of all project success. In the early 90's, the management of the Armed Forces Institute of Pathology (AFIP), located at the Walter Reed Army Hospital base near Silver Springs, Maryland, commissioned a contractor to analyze and update its Pathology Information Management (PIMS) System. The current system was deemed marginally adequate for most current activity, though data quality control had slipped to a level considered unacceptable and research on related prior conditions required multiple manual and online source reviews of old internal and external data. Often, the system took five to ten minutes to locate and display a current record. As usual, the contractors knew nothing about the functional area or the politics or history behind PIMS and the organization. The doctors and career pathologists were very independent, and each had his or her particular way of organizing, retrieving, and interpreting information. The legacy data not yet incorporated into PIMS stretched back to the Civil War, often including hand-scribbled notations produced on the fields of battle (one example: a scrap of uniform cloth with the patient's name and date and the scrawled words, "laig blowd of" [SIC - yes, one 'F']). Other data sources included everything from old lab reports scrawled on 3x5 cards and lined paper to microfiche to x-rays to punched tapes to 8-inch floppies to Univac and IBM RJE and mainframe source tapes in formats long-ago discontinued. Worse than the challenge of widely disparate media, much of the terminology in pathology had changed over the years, and not only computers and library archiving methods had crossed several major thresholds of change, but so had medicine. Definitions for similar items and phenomena were often irreconcilable, disputed amongst the pathologists, and unrenderable in computer terms. Indeed, the most powerful resident pathologists could not agree on the details and range of definitions represented by their own professional associations so that the diagnosis code numbers could be 'parenthetically' included in the records of the new PIMS. It is into this fray that the contractors plunged. Battles royale raged long into the night, with professional disagreements among the pathologists on not only the definition of data, but the schema - how it was to be organized! While watching the theoretical arguments whiz by, the contractors were astounded to discover that the way the data was organized also affected its meaning - and ultimately, the diagnosis. By the end of the project, and as frequently happens with organizations dominated by exacting professionals steeped in the functional wisdom of their fields, the pathologists elected to dump the effort to design a comprehensive custom-tailored system and set about searching for vendors who could automate the critical portions of the Institute's needs on an a-la-carte basis. To justify rejection of the contractor's prototype system (and the design of which they had participated), the pathologists on the evaluation team cited only that, in paraphrasal: The contractors committed to a maximum retrieval time of 35 seconds for the largest records including hi-res X-rays under heavy load conditions. In the demo, this took 45 seconds. The proposed system therefore fails to meet the new PIMS requirements. Data Challenges The architectural schema issues that dogged the upgrade of the PIMS system into a holistic, comprehensive data retrieval system now still face the XML industry groups and enterprise XML information engineers and data analysts, alike. But the real world is rife with different points of view - and thus different ways of organizing and looking at data. Hence, wherever there is a different country, culture, enterprise, or computer system, there will be a need to change and arbitrate data formats, if two disparate sources are to exchange information with each other. XML has been recognized within the e-business industry as having the power to accomplish this. Spin-offs of XML capability, including XSL and XSLT (and in a parallel and related way, DSSSL), have been developed to do just that. Data format changes within XSL systems permit the arbitration of data among different schema. Centralized enterprise data, for example, can be repackaged, interpreted, and customized between internal organizations or to the outside world, depending on end requester needs and perspectives. However, good schema design is complicated by varying needs for different levels of granularity and different interpretations on even some of the most fundamental iotas of meaning. Obsolete Data is No Data At All Some promoters and managers of DW truly intend the system to save every scrap of text or info that the enterprise generates (or is captured from the outside world) in the negligible event it may turn out to be a key find years down the road. Unfortunately, data and information is much more fungible. Like perishables and milk, a piece of data needs a freshness date! All data has a time value and a time-based, contextual meaning. Rosie Abramawitz becomes Mrs. R. McGillicuddy. At this point, the Rosie data must be split into two separate but linked values - the old data value, which accurately informs us about history, and the new one, which will allow us to locate her today. A key consideration for DW practitioners: Should data which will soon become erroneous, obsolete, or unusable be saved? This is not to say that one should change history, but rather, history often no longer has relevance and old data which was true then, now has altered meaning and is therefore confusing at best. For large enterprises, a key risk of globally-centralized, near-real-time access storehouses is that just one piece of bad or obsolete data of the right kind can have an enterprise-wide damaging impact. Far more overlooked and damaging than erroneous data (assuming effective data edit systems), data made erroneous through time obsolescence can wreak pure havoc in an enterprise's systems. Data units for the distance to Mars on a space mission is changed by government regulations to reflect kilometers rather than miles. Active taxpayers, contributors, voters, or statistical study subjects become deceased. Two divisions are consolidated into one, and one set of records are zeroed out or reflect random, semaphoric, special-flag, or the final run's data from that point forward. The customer received a newer model as a warranty replacement unit, but no one can decide how to update the database. Did any data fields or records slip by unconverted? Oops. Diagram data code --> data text value --> data definition --> data interpretation --> data meaning --> information --> understanding Think about the data element, color, and a corresponding data value green. Green can be all greens (range from yellow with a greenish tint to cyan nearest deep blue). But is that all greens? Depends on who you ask. And what it is you're going to observe or paint. Etc. For example, Green can be one green, say, Teal. Teal is a blue-green. But, exactly, which blue-green, wavelength- and spectrum-wise? To human eyes, even the narrowest definition of 'teal,' a precise spectral photonic wavelength, can be exactly duplicated by mixing one wavelength from the yellow-green spectrum-sensitivity range of retinal cones plus one from the blue-sensitive cones. Once again, it depends on who you ask, and in what context this information iota is to be applied. In the distant past, in order to limit confusion (or, from the other side of the coin, to improve accuracy, precision, and conciseness), mainframe information engineers institutionalized computer codes to serve as universal symbolic references that could only represent a very specific, narrow interpretation that most all users could clearly comprehend without effort or training (perhaps the most successful of which, the State Codes, e.g., NY, CA, MD, & VA, arose from mainframers' abbreviations and were adapted by the USPS). While others required manual table lookups, these symbolic contrivances had the distinct advantage beyond storage efficiency of being difficult to misinterpret. Unfortunately, most data values and natural language schema element descriptors (e.g., in XML) are very parochial and particular to the local or office where they are originated (nay, not even interpretable across the enterprise). Consequently, both codes and text values, at the originating or computer end, and naming-convention-and-conceptual-identification via office training, at the human end, are necessary to ensure that an iota of information coming from a database or data element is correctly interpreted and properly applied in the course of operations. Without this kind of pragmatic but holistic conceptual (interpretation-education and information-program) development, for example, higher mathematics could not exist. Contextual Errors Data structures, be they in databases or in XML DTDs or schemas, provide part, but only part, of a data value's context. Even the simplest data of all - an empty data structure containing no data - is subject to considerable interpretation. Take the number zero or the text value, blank. Let's say the schema tells us this data element represents sales. Does a number zero mean that there were zero sales made by the Specialty Division for that week? Or just that the data element hadn't been populated with that data yet? If the number zero was entered intentionally, does it mean that the sales were actually zero, or that the Specialty Division failed to report sales to the Data Center that week? Or does it mean that sales figures for the Specialty Division are now consolidated with the Widget Division and this is the first week that the Specialty Division data elements in this database will forever carry zeros? The list goes on. Blank text fields beget similar problems. In a fairly blatant example of conflicts in interpretation, major HMOs have been known over recent years to use the 'diagnosis' field in their Federal forms data databases to indicate a suspected malingerer or a telephonically difficult-to-manage customer (actual sample quote, as spelled: "total rack-ass"). This tactical misuse of a defined data structure has led to quite some confusion when a physician retrieves patients' medical history - and one can only speculate how this practice has impacted ongoing statistical medical research studies! Even hard, factual data can be, by virtue of its very juxtaposition and context, either scientific, political, legal, motivational, operational, or some combination. Siemens attempted to standardize communications and information worldwide among all of its computer systems in 160 countries using test prototypes based on XML. The system - known as ShareNet - was to support operations and serve as a corporate-wide knowledge base (KM) to aid its worldwide sales forces. To the surprise of Siemens management, intractable differences in language and culture made XML's natural language data representation hugely laborious and difficult to interpret and integrate (e.g., German to English to the various native language), so Siemens ultimately scrapped the system in favor of more deterministic Oracle-to-Web-page tools from Arsdigita Corp. It also didn't help that both bandwidth and processing power in many of these tiny countries (using mostly 286s, 386s, and World War II vintage phone lines) was miniscule in a majority of those countries and frequently lead to intolerable performance hits. One can imagine the algorithmic complexity of a middleware server transmitting data globally to corresponding servers while having to translate data content in up to 160 languages plus screening for illicit words or other cultural faux pas, all while dynamically recasting data structures and schemas, and all at the same time. In these environments, XML's openness can be as much a challenge as a benefit. Given these challenges in data capture and conversion, vendors of XML systems may be more successful in the long run by pitching them as archival or data comm development tools backed up by sustained vendor support rather than as corporately-wide implementable COTS plug-in solutions for viable international business. Like many other computer-based processes, the promise of XML tends to fall apart at the person-machine interface. Data entry errors create no shortage of horror stories - in one Virginia law enforcement database, some detainees were on record as committing "a salt and buttry", while others were mailed tickets for "failure to get a town bus" (business license - 25-character field limitation). Audit of a Massachusetts State database turned up over fifteen different misspellings of the word "Boston," indicating to the department in question why it had so few matches with other State forms databases despite an initial run of several hundred thousand records. Organizations proposing XML data definition standards for inter-enterprise communication within specific industry groups are encountering the identical problems as with EDI in its infancy. Even the most fundamental data elements such as 'Company Name' and 'Address' defy deterministic definition. Data value interpretation is affected as well. "No" can refer to "North" or "Number", or even the opposite of "Yes." "St." could stand for "Saint," (business) "Stop," or "Street." Trivial inclusion or omission of punctuation such as periods, dashes, extra spaces, or commas often cause match failures or user confusion. Names of entities or persons can be abbreviated, be listed with nicknames, preceding qualifiers, trailing qualifiers, or middle initials. Lack of qualifiers, e.g., title indicating gender or status, numerical addresses of shopping malls and industrial parks w/o sub-office number, etc., cause match fails or false positives and complicate the use of XML to communicate meaning between two disparate servers or enterprises. Legacy template data, e.g., 99/99/99 or 999.9, indicating unfilled fields, can cause a false match for the year 1999 or a large value for sales. Owing to contextual differences in the interpretation of data among the functionally and geographically separate parts of large enterprises, the founders of the data warehousing concept were eventually compelled to recognize the difficulty of maintaining and managing one central store for the corporate archive or DW function, and stepped back a bit to the concept of Data Marts. Data Marts consisted of multiple key storage sites divided more often on functional than geographical lines. They were empowered to disseminate their locally-generated information to approved enterprise end users while arbitrating its meaning. This permitted major stovepipe systems and their supporting organizations to broker and barter information on the contextual and interpretive basis that they desired and enabled the development of data links that ensured that data quality, data definition, and data value interpretation were (in their view) properly maintained. In practice, though, the underlying driving forces for selecting the Data Mart model and information-broker architectures have often been assessed as being more political than pragmatic. As a number of GAO audits have revealed, the IRS might be a telling case of the un-implementability of the central-store archive or DW concept, since, although the legacy core functions and systems are identical, operations are divided on a geographical basis by Processing Centers and there is no one central 'Master File' system storing information on all taxpayers. This poses endless problems and difficulties when taxpayers move or it is desired to run enterprise-wide statistical analyses to determine new trends in the overall U.S. taxpayer base. Vocabulary and Definition Errors What would each of two companies define as an order? A P.O.? Delivery? Acknowledgement of receipt? Ultimate payment? Or non-return within grace period? While this may sound like a data value interpretation problem, there are substantial implications for XML schema. The instinctive approach is to create multiple parallel definitions in an industry XML standard, but in the great majority of a company's data, only one definition is used - but that major definition varies widely from company to company, rather than instance to instance. But as XML schema get more complex, templates, entities, and tag populations can proliferate uncontrollably. A constellation of corporations looking over W3C's shoulders and pitching for their pet additions and modifications didn't seem to help either. This seems to be why, until very recently and years after commencing, the final W3C and the United-Nations-backed ebXML schema standards have still not been officially fully finalized. Commerce XMLs and many vertical industry XML standards efforts seem to be suffering a similar fate. Despite the clearly annunciated objectives of the panels' efforts, none can seem to answer the question, "How do we know when we're done?" The continuing saga feeds fuel to those who make money playing Proprietary Games: in an apparently repeating phenomenon, Eweek in an article in May, along with other sources, reported that during the year immediately following new standards releases, the same companies on the panels of standards committees of these increasingly complex standards experienced a brisk upturn in sales of the very software tools necessary to deal with the added complexity. Conflict of interest? Or community service? Hmm... XML is more easily scalable horizontally (i.e., outward to other functions, organizations, and entities), but still it's no picnic, as many tech companies on the rocks are finding out. Often one hears from XML boosters the assertion that "any application that can handle XML can get all the information it needs to understand a message from within the message itself." In fact, this is only possible if there is contextual understanding between the sender and receiver (or the storer and retriever, as the case may be). A number of XML conversion failures in the realm of legal contracting have underscored the point. Returning to the example of the simple inter-corporate sales transaction using XML messaging, the need for updating joint and mutual contextual understanding becomes apparent when a business partner changes the definition of a data element. If you've made prior arrangements with a supplier, say, you can use Microsoft's SOAP (Simple Object Access Protocol) to perform an ad hoc remote procedure call which will return an XML schema defining the data structure required by the supplier for inquiring about, say, the status of a delivery order. Then an XML transaction query can request specific data in the new schema's format. The combination of the two queries is called an 'XML query', and the neat part is that the query operation will succeed and deliver data even if the supplier totally changed the XML schema and data record structure immediately prior and without notice. What's not so neat is when the XML schema and data is returned with a brand new data element field described as "fram-winding wizzlebats" - and since neither man nor machine can make heads or tails of such descriptors, blind XML queries often still require a further, if quite ordinary and low-tech, email or phone call. Perhaps the largest impact that XML will have is the ability to organize Web sites by separating content from presentation and inexpensively linking dynamic Web pages to a single, comprehensive Web site database. The home shopping channel QVC completed a Web site based on a catalog database sporting more than 20,000 items, and the system interacts well with most of the major search engines (Yahoo, Google, etc.), so casual Internet browsers are easily able to run across QVC's Web offerings (and no one can argue with the success of Amazon or Ebay!). Some vertical industry groups that have formed have already produced XML schema standards for their industry such as the Chemical Industry Data Exchange's cML. In some cases like cML, pressure for expediting standards development comes from the need to find cost-effective means of filing for and meeting Federal and State government regulations. In the case of the Chemical Industry, heavy environmental and safety reporting requirements and the need for labor-saving automation of forms reporting made swift development of cML an obvious choice. However, analysts in most industry groups say that XML standards for their industries have not yet matured to the point that they can be widely adapted, though hi-tech industries such as electronics, computers, and specialty manufacturing and information-based industries such as insurance and finance are further along in this endeavor, with their many specialized vertical-industry standards. Early reviewers of some of the vertical-industry's-specific standards for inter-enterprise data communication using XML have underscored the risk of going down the same path as EDI and becoming too complex. Some say the new standards are so extensive and detailed that they may be eschewed by most of the companies for which they were designed in favor of more flexible, lightweight versions that will emerge - much as happened with HTML over SGML and LDAP over DAP. It was not until HTML and LDAP were developed and universally agreed upon that these much simpler "subset" versions of the venerable but little-used SGML and DAP standards enabled Web sites and global e-businesses to take off. A pragmatic strategy for those contemplating participation in their vertical industry group's XML recommendations is to remain flexible. Use of these standards to communicate 'blind' on an ad hoc, as needed basis with new customers, partners, and suppliers does not mean the standards must be rigorously enforced for intra-enterprise use. Further, use of multiple schema or subsets makes operational sense to get in the game immediately in order to cease the nuisance of placing routine orders and exchanging computer data via fax. An emerging service from some Internet business brokers and portals is 'info-mediation' where infomediaries (special servers) perform translation among multiple dissimilar schema in the same or different industries before delivering XML transactions to disparate recipients. EMPHASIS BOX By carrying its own structural definition with it, XML uniquely provides the ability for automated systems to transmit useful data even on an ad hoc, unanticipated basis. One might envision a conversation stemming from such a transmission arriving at an unsuspecting company: "Hey, Joe! Someone just shipped us a box of data. Beats me why we got it or what it's for. Bounce it to Corporate to take a look." - and sure enough, the data would turn out to be useful, important, and in the corporate database in minutes - all without programming or pencils, thermal fax ribbons, or Xerox toner ever touching paper. Since it's in English (e.g., at least for XML streams in places such as Australia, Britain, Canada, India, Scandinavia, South Africa, and the U.S.), a human arbiter or e-postman can serve as a data comm "switch" and redirect it to the proper entry-point or office location where it's needed - something a random sender couldn't possibly know. The advantages over blindly transmitting a "To Whom It May Concern" email with an attached ACCESS database are obvious. While this is clearly not a recommended use of XML, it illustrates both the usefulness and the dependence of any XML transmission/retrieval on contextual preparation and pre-acceptance at both ends of the XML stream. Schema and data definitions alone are not enough. Data Transmission and Messaging Inter-server and inter-process communications software using XML as a messaging or data communications facility can be divided into three classifications: Application Services Exchange (ASX), Distributed-Server Middleware (DSM), and Legacy-Data Middleware (LDM). Applications Services Exchange processes receive incoming requests and data from external organizations and entities (usually either through a firewall or via a VPN or private network) and converts, packages, and provides secure delivery of enterprise data on the return. This can included Web sites or other sites external to the enterprise, and contain a wide degree of inter-system status and control messaging as well. Where status or customer data cannot be maintained on the requestor's server, front-end XML databases fill the gap. The second form of XML processing, Distributed-Server Middleware (DSM), is intra-enterprise server-to-server middleware for transaction messaging among disparate platforms. Frequently, multiple small XML databases are used in 'caching' data for and the pending status of transactions whose response from servers down the line have not yet been received. Lastly, an important form of middleware to legacy systems for accessing and translating mainframe applications data, Legacy-Data Middleware (LDM), is similar in concept to DSM, but separately identified from the second category owing to the complex nature of legacy platform access. Various proprietary hardware and communications equipment boundaries must be crossed, and even currently, data return is often still complicated by the need for various inelegant methods such as polling, caching and translating data among multiple disparate in-line pieces of communications hardware, as well as ad-hoc systems for periodic batch capture. Suspense databases and hold-transaction-until-complete database caches are achieved via LDM using XML on the communications interface and non-legacy ends. In the operation of the three XML-bridged activities, XML has proven to be not only a great data-transmission and conversion tool, but also a handy facility for handling data caching during data transmission over distributed network systems. Web professionals and e-commerce middleware creators discovered in the early 90s that XML can be used to create small- to medium-sized databases in record time, thus skirting around the expensive and time-consuming effort required to mount Oracle, SQL Server, and other full DBMSs for small jobs or on temporary servers. The databases can contain supporting information, temporary caches of data, repositories of messaging components and data, schema conversion and other lookup tables, etc. This is where the opportunity for software producers and the market for "up-and-running" SOHO COTS packages for XML -- particularly in Desktop Publishing -- is about to explode. Larry Ellison's 'One Database Fits All' and 'middleware is unnecessary and ought to be banned' speech at the E-Gov 2001 convention aside ("Ve KNOW Vhat dee Pipple Vahnt!", to paraphrase the resultant joke of a now-retired IBM-er), XML messaging increasingly appears to be taking on the mantle of the international standard for inter-site and inter-computer communication. Since innovation and non-standard computing solutions will always well-up from the grassroots of computer users and businesses, XML communications and data repository solutions to middleware-, data access, and other tower-of-Babel problems will be, for the foreseeable future, very much in demand. The viability and success of the results, however, will depend on the competency of each XML implementation - and this clearly requires a re-visitation to the tried-and-true practices of rigorously applying the principles of software and information engineering. Copyright (c) 2003 Software Technology Magazine. All rights reserved.