Inside Open XML
Open XML is not just a new file format for the latest version of Microsoft Office, but an open standard capable of expressing any Word, Excel or PowerPoint document.
Originally published on DNJ Online, Jun 2007
The 2007 Microsoft Office system brings many changes but the most significant, as far as organisations of any size will be concerned, is one that few end-users will ever see. This is the new Office Open XML file format used by Word, Excel and PowerPoint 2007 to store documents. For Office 2007, Microsoft has adopted a native file format that, while able to support all the features of Office documents back to Office 2000, is also an open standard capable of being read, manipulated and extended by third-parties.
The Office Open XML standard was approved by Ecma International (previously the European Computer Manufacturers Association) in December 2006 and represents the collaborative effort of some 20 members including vendors such as Apple, Intel and Novell, and organisations such as BP, Barclays Capital, The British Library and the US Library of Congress. It has now been submitted to the ISO/IEC Joint Technical Committee for ratification. The full specification runs to over 6,000 pages but a useful overview can also be found at www.ecma-international.org/publications/standards/Ecma-376.htm.
Open Packaging Convention
Central to Open XML is the Open Packaging Convention (OPC). This is defined in Part 2 of the specification but is also used by Microsoft’s XML Paper Specification (XPS), originally codenamed ‘Metro’. OPC describes a method for storing a set of files within a single compressed file known as a ‘package’.
This is done using ZIP technology, as originally created by Phil Katz of PKWARE, and indeed if you take a .docx file saved from Word 2007 and change the file extension to .zip then you can see what is inside using either the WinZip utility or the ZIP support which is now built into Windows itself. Shown here is the contents of a fairly simple document containing a couple of paragraphs and one embedded image.
As you can see the package contains a number of parts grouped into various folders. Most of these contain XML data structured in accordance with published schema that describe various aspects of the document, although a part can also be a binary data stream such as the embedded picture image1.jpeg shown in our example. One useful side-benefit of the format is that original resolution versions of embedded images are stored as distinct parts with their original file extension within the package (usually within a ‘media’ sub-folder), from where they can easily be extracted.
Other parts contain metadata that define content type and the relationships between the parts. The content types used by every part within the package are defined in the file [Content_Types].xml, which must be present in the root of every package. The only other reserved location is /_rels/.rels which specifies the relationship between the package itself and the top-level parts.
To understand this better, let’s look at what is required to create a basic ‘Hello world’ document. At the very least the package must contain a content-type part, a package-relationship part and a document part. As you can see, these are all standard XML documents with defined namespaces. In the content-type part and the package relationship part these reference schemas that are specific to OPC. For the document part the namespace references the WordprocessingML schema, introduced with Office 2003 and now part of the Open XML specification:
The content-type part /[Content_Types].xml:
The package relationship part /_rels/.rels:
The document part /document.xml:
Example taken from the ‘Explanatory Report on Office Open XML Standard (ECMA-376)’.
The purpose of the content-type part is to tell consuming applications such as Word 2007 how to interpret the various types of content that the package contains. In this case the part defines just two content types. The first is for files with the ‘rels’ extension which define relationships; the second is for files with the ‘xml’ extension which in this case are to be interpreted as WordprocessingML documents.
In more complex packages the Default Extension element may simply define files with the ‘xml’ extension as having ContentType ‘application/xml’. This would be followed by a number of more specific ‘Override PartName’ elements specifying, for example, that the file /word/styles.xml should be interpreted using the WordprocessingML style specification.
By convention, parts are arranged within specific folders inside the package. Most Word 2007 packages store the file document.xml inside a /word folder, for example, and embedded images inside the sub-folder /word/media. However, as our ‘Hello world’ document demonstrates, this needn’t be the case as here document.xml is to be found in the root of the package. Indeed the only folder that must be present is the /_rels folder that holds the top-level relationship part ‘.rels’.
So applications cannot rely on the internal file structure of the package to discover the location of the various parts. Instead they should use the relationship parts for this purpose, starting with /_rels/.rels. In our ‘Hello world’ example this defines just one relationship, identified as ‘rId1’, which tells us that document.xml is in the root directory and conforms to the ‘officeDocument’ schema.
For ‘Hello world’ that is the end of the story: we have found the one and only part that defines the content of the package, and we know that the content is stored in WordprocessingML. In more complex packages, such as that shown in Figure 1, we can see that the /word folder also contains a /_rels folder which contains another relationship part, namely document.xml.rels. This contains further relationships that target the remaining parts of the package. The relationship identified as ‘rId6’, for example, targets the embedded image /media/image1.jpeg.
This may seem a complicated way of doing things but it does offer a number of advantages. For a start it means that consumer applications can discover parts and the relationship between parts without having to interpret application-level schema. It also means that relationships can be established and modified without having to touch the actual content of the document.
For example, the relationship schema also supports a TargetMode attribute, which by default is set to ‘Internal’ to indicate that the target is internal to the package. However it can also be set to ‘External’ where the target is not part of the package. The following, for instance, defines relationship rId9 which targets a particular page on the DNJ Online Web site:
The content of the document, stored within document.xml, might include the following line:
<w:hyperlink r:id=”rId9” w:history=”1”>
Without going into the complexities of WordprocessingML, this specifies a hyperlink whose target is defined by the relationship rId9. In the document it would appear as a link to the specified article. However the target for this link could be changed without making any changes to document.xml, simply by changing the relevant Target attribute within the relationship part.
The internal structure of the Open XML format makes it possible to change the content of a document without having to parse the main document part itself. By editing the appropriate relationship we can swap Part1.xml content for Part2.xml within the document.
To take another example, the following SpreadsheetML defines a workbook containing a single worksheet:
<sheet name=”Sheet1” sheetId=”1” r:id=”rId1”/>
Which worksheet it contains is determine by the relationship rId1. Again, this can be changed by editing the appropriate relationship, without having to go into the SpreadsheetML itself.
Furthermore, as we shall see in our next article, the Packaging API that comes with .NET includes methods that make the process of creating, walking through and editing relationship parts fairly straightforward.
The Open XML specification defines three primary markup languages which target each of the document editors in Office 2007. These are WordprocessingML, SpreadsheetML and PresentationML which first appeared with Office 2003. Each is based on XML, as you can see from the small snippet exhibited in the document.xml part of our ‘Hello world’ document.
The structure of these languages is fairly straightforward although their definition does take up the bulk of the 6,000 pages of the full Open XML specification. This is because they are intended to support the full feature set of all versions of Word, Excel and PowerPoint going back to Office 2000, enabling documents created with any of these applications to be faithfully captured.
In the case of WordprocessingML, the root is the ‘document’ element which contains a ‘body’ element. This in turn can contain a number of ‘p’ (paragraph) elements which can contain multiple ‘r’ (run) elements. A run is a group of characters that have identical properties and so require no additional markup. Runs would in turn contain ‘t’ (text) elements and there are also ‘rPr’ (run property) and ‘pPr’ (paragraph property) elements for defining the format of these elements, and so forth. Footnotes, headers and footers are stored in separate parts within the package.
SpreadsheetML starts with the root ‘workbook’ element which contains ‘sheet’ elements pointing to the various sheets in the document. Each sheet is stored in a separate part that contains a ‘worksheet’ element that goes down through the ‘sheetData’ element, the ‘row’ element, the ‘c’ (cell) element and finally to the ‘v’ (value) and ‘f’ (formula) elements. One point to note is that the sheetData element only contains information about cells that are not empty. Other aspects of the specification deal with charting and formatting.
PresentationML defines the root element ‘presentation’ which contains pointers to ‘sldMaster’, ‘notesMaster’, ‘handoutMaster’ and ‘sld’ (slide) parts. These in turn support elements describing the various graphics, text, tables and charts used on the slides themselves. In addition there is DrawingML, a comprehensive markup language for defining vector graphics. There is also support for VML (Vector Markup Language) although this is included solely for backward compatibility. DrawingML is to be preferred wherever possible.
One seemingly trivial but important aspect of these markup languages is the brevity of the element names – often just a single character. This is deliberate as it helps to cut down on the size of Open XML documents. Other space-saving features include the sharedStrings part used by SpreadsheetML to ensure that repeated strings of text are only stored once.
XML Digital Signatures are supported through the Digital Signature Origin part which is referenced through /_rels/.rels and provides a starting point for discovering the signatures contained within a package. Each signature is held in a separate part which has the root element ‘Signature’.
Extending Open XML
It should already be apparent that you can manipulate Open XML documents extensively without the help of Office 2007. The .NET Framework makes it easy to reach inside OPC packages and XML documents and make changes, and as we have seen the use of relationship parts means that many modifications, such as altering the order of slides in a presentation or changing the style of a document, can be made without getting to grips with application-specific markup languages such as WordprocessingML.
What is also apparent is that consumer applications need only concern themselves with the parts within the package that are relevant to them, leaving other parts untouched. An image editor, for example, could reach inside the package shown in Figure 1 and edit image1.jpeg without affecting, or indeed having to know anything about, any of the other parts.
Open XML also supports Custom XML Data Storage parts which can contain XML data conforming to any schema you want. A part could contain data about a contract, for example, such as buyer and seller names and addresses, property details and so forth. Once the appropriate relationships have been set up within the package, this data can be dropped into the document where required and any changes to the custom data will be automatically reflected in the document. Furthermore, documents can contain more than one Custom XML Data Storage part, each tailored to the needs of a different application. This opens up many opportunities for document manipulation and workflow control.
Support for Open XML
The primary purpose of Open XML is to create a file format that is based on industry standards (namely ZIP and XML) and capable of accurately storing the full complexity of Word, Excel and PowerPoint documents since Office 2000. With this in mind, Microsoft has produced the Compatibility Pack.
Once this is installed, users of Office XP or Office 2003 will actually see the new Open XML file formats listed in their Save, Save As and Open dialog boxes. Office 2000 users can load Open XML documents by double-clicking them but have to convert saved files by right-clicking them in Windows Explorer and then selecting the appropriate Open XML file format from the Save As option.
The Compatibility Pack is a free download from the Microsoft Web site and works with Word Viewer 2003 and Excel Viewer 2003 as well. The new PowerPoint Viewer 2007 allows you to view any presentation produced by PowerPoint 97 onward. These are also free but do not allow you to edit or save documents.
Office 2007 will open documents created by Office 97 through to Office 2003 in a ‘compatibility mode’ from where they can be saved to either their original file format or as Open XML. If you want to convert a large number of files automatically then you need the Office File Converter which comes as part of the Office Migration Planning Manager (OMPM). You can download this from the Microsoft TechNet Web site.
It is early days but a number of third parties have already announced their support for Open XML. Corel has said that WordPerfect Office will support Open XML sometime this year, while Novell has announced that its version of the open source application OpenOffice.org will support Open XML, and that it will submit the relevant code back to the OpenOffice.org project.
Novell’s announcement is particularly interesting as the native file format for OpenOffice.org is Open Document Format for Office Applications (usually referred to as ODF or OpenDocument). This is an OASIS standard that was approved as an ISO/IEC standard in May 2006, and is also supported by Sun Microsystem’s StarOffice suite.
DF, like Open XML, aims to encapsulate documents, spreadsheets and presentations, and does so by combining XML and binary files in a ZIP archive. However ODF content markup is much more like XHTML and does not attempt to capture all the information created by older versions of Microsoft Office. ODF does support custom metadata, but doesn’t have anything analogous to the relationship parts of Open XML.
There is an open source project at http://odf-converter.sourceforge.net that is developing an Open XML/ODF Translator Add-in for Office which allows Microsoft Office XP, 2003 and 2007 applications to open and save ODF files. It also promises to be able to handle batch conversions.
There is much and often heated debate over the relative merits of Open XML and ODF, but it is clear that the aims of the two specifications are different. Certainly, there is room for multiple standards – witness the JPEG, PNG and CGM image formats which are all ISO/IEC standards. As we said earlier, Ecma has submitted Open XML for consideration as an ISO/IEC standard, which would put it on an equal footing to ODF.
Politics aside, Open XML opens up many new possibilities for office automation and, as we shall see over the page, is well supported by the .NET Framework.