This project has moved. For the latest updates, please go here.

OpenXML Validation patch

Nov 11, 2013 at 9:41 PM
First, a bit of background.

We have taken an interest in getting our OpenXML Word documents to pass schema validation, in other words this check:
    public static void AssertThatOpenXmlDocumentIsValid(WordprocessingDocument wpDoc, string message)
    {
        Check.RequireNotNull(wpDoc);
        var validator = new OpenXmlValidator(FileFormatVersions.Office2010);
        var errors = validator.Validate(wpDoc).ToList();
        ReturnAssertErrorsForOpenXmlValidation(message, errors);
    }
OpenXmlValidator is provided in the SDK.

The current output of HtmlToOpenXML fails this validation for our not-terrily-complicated HTML input, and we'd like it to pass. It's all very nitpicky stuff which the Word Application accepts, but which fails schema validation. It revolves around the order of elements. This article covers it in a bit of detail:

http://blogs.msdn.com/b/brian_jones/archive/2009/01/12/open-xml-sdk-the-basics.aspx

--Quote---
There's even more useful functionality in those four lines. Here's the equivalent without using the first class properties, can you spot what's wrong?
RunProperties rPr = new RunProperties(); 
rPr.AppendChild(new Italic()); 
rPr.AppendChild(new Bold()); 
rPr.AppendChild(new NoProof()); 
This snippet actually creates a schema invalid document. The schema specifies the children of the rPr element as a sequence, so order matters. Bold (w:b) must come before Italics (w:i) for the file to be valid according to its schema. The code snippet using the property assignments gets this right (because the code behind those assignments knows about the order, which the second one just obeys the calls).
--Quote---

The key is to use the attributes instead of just AppendChild. For example this:
new StyleRunProperties(
    new Bold(),
    new BoldComplexScript(),
    new DocumentFormat.OpenXml.Wordprocessing.Color() { Val = "4F81BD", ThemeColor = ThemeColorValues.Accent1 },
    new FontSize { Val = "18" },
    new FontSizeComplexScript { Val = "18" }
)
becomes this:
new StyleRunProperties
{
    Bold = new Bold(),
    BoldComplexScript = new BoldComplexScript(),
    Color = new DocumentFormat.OpenXml.Wordprocessing.Color() { Val = "4F81BD", ThemeColor = ThemeColorValues.Accent1 },
    FontSize = new FontSize { Val = "18" },
    FontSizeComplexScript = new FontSizeComplexScript { Val = "18" }
}
I would like to tell you this is a 100% complete fix, but there are undoubtedly problem areas in the HtmlToOpenXml code we haven't found yet. We are hoping to get these initial changes included into the trunk to at least get a start going on this, and we pledge to bring you any more such changes as we encounter them. Alternately we would be happy to work against a test suite, but I didn't see one in the project file we have, and I don't know if you have anything like it already.

It seems I can't attach files here, so here's a link to the patch:

https://www.dropbox.com/s/1zk7r4jqjvn6yu4/HtmlToOpenXML_unified_validation_patch.diff
Coordinator
Nov 12, 2013 at 9:08 AM
Hello,

Thank you very much for your patch. I will include it soonly.
If you have any other bug fixes, do not hesitate to come back to me.

It's nice to know there is again some people showing some interest in this project.
Just by curiosity, why do you want to assure the schema validation ?
Nov 12, 2013 at 5:16 PM
Edited Nov 12, 2013 at 5:19 PM
Schema validation at least assures us that the file will load in the Word application. Obviously if it fails validation it may still load, but at least this way we can be certain that if it passes schema validation, it is going to load.

Our files may still have wrong content or presentation, of course, but that's a different class of bugs that's far more difficult to write automated tests for.

Do let us know when the fix is incorporated so we can sync up our code base with yours.
Coordinator
Dec 11, 2013 at 12:05 PM
This discussion has been copied to a work item. Click here to go to the work item and continue the discussion.
Coordinator
Dec 11, 2013 at 12:20 PM
Hello, I have merged your code inside the main branch.
Finally, I had a review on more tags (especially tables and all the styles) and refactor lots of my code to make it works. I rely a bit on the Reflection to obtain the order sequence of the style attribute and to know where to insert them. As the Reflection call is cached, it shouldn't be impactful.

Thanks to you, I found 2 existing bugs :-)
Dec 11, 2013 at 12:24 PM

excellent...I'll download the main source code again and update. I'll let you know if anything else pops up.

cc

Coordinator
Dec 11, 2013 at 12:29 PM
Yes let me know if that fixes your problem on Google Viewer.
Otherwise, indicate me how to reproduce your bug and I will test myself (sorry, I'm not used to Google Docs).
Dec 20, 2013 at 12:36 PM

Hi,

Unfortunately, it does not. It does behave differently - I still can't view it within the google docs viewer but *can* download it. So it is slightly better but still fails the validation, I should think.

To test, you just need a publicly accessible URL for a document (generated by the library) and embed that URL in the following one for Google: https://docs.google.com/viewer

The link points to a test page.

regards

Chris Cardinal

Dec 20, 2013 at 5:34 PM
We are really pleased you decided to accept our patch and try to improve validation - again, thanks so much!

We recently started having a new validation error in production for a particular bit of HTML. I have tried to come up with a fix myself, but it is taking me long enough to understand that I thought posting it here might be faster.

We are getting validation errors like this:
ID: Sch_AttributeValueDataTypeDetailed
Description: The attribute 'http://schemas.openxmlformats.org/wordprocessingml/2006/main:val' has invalid value ''. The attribute value cannot be empty.
ErrorType: Schema
Node.OuterXml: <w:tblStyle w:val="" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" />
Path.XPath: /w:document[1]/w:body[1]/w:tbl[2]/w:tblPr[1]/w:tblStyle[1]
Part: /word/document.xml
It's because HtmlToOpenXML is trying to set a null "Val" parameter in a tblStyle element. This setting happens in a variety of places in the HtmlToOpenXML code; I haven't yet isolated which is causing us our particular problem, but here's a representative one from HtmlConverter.ProcessTag.cs
                Table currentTable = new Table(
                    new TableProperties {
                        TableStyle = new TableStyle() { Val = htmlStyles.GetStyle("Table Grid", StyleValues.Paragraph) },
                        TableWidth = new TableWidth() { Type = TableWidthUnitValues.Pct, Width = "5000" }, // 100% * 50
                    },
here htmlStyles.GetStyle is returning null, so the required attribute Val is null, and we fail validation.

If the fix is immediately obvious to you, great. Otherwise let me know if you need more detail, such as the original HTML source - I can build a test case up for you.