Categories > OpenTBS with DOCX >

Another Corrupt Simple Word (docx) File Problem

The forum is closed. Please use Stack Overflow for submitting new questions. Use tags: tinybutstrong , opentbs
By: John Lawler
Date: 2012-02-10
Time: 00:27

Another Corrupt Simple Word (docx) File Problem

Hi, it sounds like I'm having a problem identical to this thread:

  http://www.tinybutstrong.com/forum.php?thr=2849

and it's possible the cause of my problem is the same as that poster, only I'm having trouble following exactly what the suggestions were and what problem he says he found.

Skrol, when you say "You can check by checking the very first binary bytes of the file", which file do you mean?  The docx (zip container) file itself or the word\document.xml file?

I did do a quick binary edit of the docx file itself and saw the characters 'PK' at the beginning, so I assume that's fine.

I originally did have a stray printf in my PHP script for debugging, and I was hoping that was causing the problem, but when I removed that, the problem persists.

My symptoms are the same, I think, but I'll spell them out more explicitly to help others find this (and the other thread).  When I open the generated document in Word 2010, I get "The file test2.docx cannot be opened because there are problems with the contents." I click "OK", and get another prompt: "Word found unreadable content in test2.docx.  Do you want to recover the contents of this document?  If you trust the source of this document, click Yes."

I do click "Yes", and as the other poster indicated, the contents look fine, so it must be some very trivial corruption or problem or Word would probably completely fail to open it.  I've run an XML validator (Tidy) on the word\document.xml file and it appears valid.

Thanks for any help.
By: Skrol29
Date: 2012-02-10
Time: 00:54

Re: Another Corrupt Simple Word (docx) File Problem

> Skrol, when you say "You can check by checking the very first binary bytes of the file", which file do you mean?
>  The docx (zip container) file itself or the word\document.xml file?

The docx.

> I did do a quick binary edit of the docx file itself and saw the characters
> 'PK' at the beginning, so I assume that's fine.

It seems to.

Try with
$TBS->PlugIn(OPENTBS_DEBUG_XML_SHOW);

If it doesn't help, it may be a bad merging in "word\document.xml".
Try with deactivate MergeBlock() one by one until the final DOCX is valid.
If it still not valid, then continue by deactivating automatic fields ([onload], [onshow], [var]) on by one, beginning with those with parameter "ope=changepic".
When you've pointed on the buging merging, try to see why the result is unvalid for Ms Word. I can help you there.


By: John Lawler
Date: 2012-02-10
Time: 19:35

Re: Another Corrupt Simple Word (docx) File Problem

Skrol, firstly, thanks for the response.

Secondly, augh!  I'm pulling my hair out over here, but I think I finally have something conclusive that may let you reproduce the problem I'm having.  Remember I said that my test application for OpenTBS was to produce a multi-page Word document almost completely containing drawing elements (text boxes and lines, mostly).

I've been having trouble from the very beginning with the two warning boxes popping up every time I loaded my generated files, but I've finally concluded with these two observations:

a) If for some reason, I only merge one record, resulting in one page, I do not get those warning pop ups, on my generated files.  That's interesting, but doesn't help me much.

b) More importantly, I was able to cause your demo_ms_word.docx template to exhibit what appears to be the *exact* same behavior if I make this simple change to your template file: on the last page, with the image in it, (in Word 2010, anyway) go to the 'Insert' tab, choose 'Shapes' and choose a Text Box, for example (I think any shape might cause the problem).  Draw a small text box on the last page, next to the image, then re-run your demo application on the newly modified template.

If your test goes the same as multiple passes at this have gone for me, I think you will now see the error / warning.  The test can be made much simpler too, but just isolating it to basically that last page, if you want.  Do you have any idea why Word would be complaining about a simple rectangle being added but not an image (which I assume is represented in the XML in a similar fashion)?

If you can't get this to fail on your side, I can forward you the modified template.

Thanks.
By: Skrol29
Date: 2012-02-11
Time: 01:43

Re: Another Corrupt Simple Word (docx) File Problem

Hi John,

Thanks for that clues.
I can reproduce it, I'm working on it...
By: John Lawler
Date: 2012-02-13
Time: 17:11

Re: Another Corrupt Simple Word (docx) File Problem

Any progress on this issue, Skrol?  If there's anything I can do to assist, please let me know.

You don't think it's possible that the problem could be caused by repeated XML element ids for the individual shapes, do you?  E.g.,
<v:shape id="Text box 7"
being repeated multiple times, once on each page.  I don't know if Word would object to that or not, and also I would think that you'd have the same issue with the repeated image element in the example document that you provide that works, and doesn't issue such warnings.

That was the only thought I had off of the top of my head.  Unfortunately, Word is being quite unhelpful with detail on what exact problem it has with the document.
By: Skrol29
Date: 2012-02-14
Time: 02:31

Re: Another Corrupt Simple Word (docx) File Problem

Hi John,

Found it !!

<wp:docPr id="1" .../>

The <wp:docPr> element is mandatory, and define the ID the the drawing object (the Shape).
Its attribute "id" must be numerical and unique across the document.
Unfortunately, it is duplicated by TBS.

I will try to find a solution for the next OpenTBS version.

While waiting for the update, you can use this tip that works if you have only one shape duplicated :
Insert this field inside the paragraph where the shape is anchored.
[b.#;att=wp:docPr#id]

If you have several shapes that will be duplicated, then you can use a custom function for parameter "onformat", that generates unique ids.
[b.#;onformat=f_unique_id;att=wp:docPr#id]


Thank you very much for your help one this complicated bug.
By: John Lawler
Date: 2012-02-14
Time: 19:12

Re: Another Corrupt Simple Word (docx) File Problem

Excellent!  I'm glad you found the source of the problem.  Out of curiosity, *how* did you?  Word seemed useless as far as specifying what the issue was.  Were you just working on a hunch or did you have some other tool to help you get to what the real problem was?

Anyway, I was hoping that your work around would work for me quickly, but I've unfortunately been at it for about an hour and have to conclude I'm doing something wrong.  I wrote a simple custom PHP function, as you suggested, like this:

$unique_id = 50;
function f_unique_id($FieldName, &$CurrVal) {
  global $unique_id;
  $CurrVal = $unique_id++;
}

Then, in my template, I took the very top paragraph on the page, the one with "Page break before" set on it, and anchored all of my 4 or 5 drawing elements to it.  The only other real element I have on the page is a small table which contains an image (important part of this project); I use the table to vertically and horizontally center the image in a certain area because the exact image size may vary (and is being inserted using ope=changepic, different on each page).

I've tried many different permutations and the only time I've had TBS complete the template okay and have a result to look at, it appears that my f_unique_id output was used to replace the id on only one of the picture elements on the whole page, the other drawing elements aren't touched.  Naturally I thought this must have to do with how I anchored them, but they all show the anchor just to the left of the (one line) paragraph that contains this code:

[b.#;onformat=f_unique_id;att=wp:docPr#id]

I'm getting this frustrating error when running the PHP page:

Notice: Undefined property: clsTbsLocator::$AttForward in D:\tbs\tbs_class.php on line 933

Call Stack:
    0.0004     337248   1. {main}() D:\tbs\tbs_test_docx.php:0
    0.0492    3798736   2. clsTinyButStrong->MergeBlock() D:\tbs\tbs_test_docx.php:59
    0.0492    3799312   3. clsTinyButStrong->meth_Merge_Block() D:\tbs\tbs_class.php:676
    0.0492    3800864   4. clsTinyButStrong->meth_Locator_FindBlockLst() D:\tbs\tbs_class.php:1780
    0.0507    3839056   5. clsTinyButStrong->meth_Locator_SectionNewBDef() D:\tbs\tbs_class.php:1640


Notice: Undefined property: clsTbsLocator::$AttInsLen in D:\tbs\tbs_class.php on line 936

Call Stack:
    0.0004     337248   1. {main}() D:\tbs\tbs_test_docx.php:0
    0.0492    3798736   2. clsTinyButStrong->MergeBlock() D:\tbs\tbs_test_docx.php:59
    0.0492    3799312   3. clsTinyButStrong->meth_Merge_Block() D:\tbs\tbs_class.php:676
    0.0492    3800864   4. clsTinyButStrong->meth_Locator_FindBlockLst() D:\tbs\tbs_class.php:1780
    0.0507    3839056   5. clsTinyButStrong->meth_Locator_SectionNewBDef() D:\tbs\tbs_class.php:1640


Notice: Undefined property: clsTbsLocator::$PrevPosBeg in D:\tbs\tbs_class.php on line 964

Call Stack:
    0.0004     337248   1. {main}() D:\tbs\tbs_test_docx.php:0
    0.0492    3798736   2. clsTinyButStrong->MergeBlock() D:\tbs\tbs_test_docx.php:59
    0.0492    3799312   3. clsTinyButStrong->meth_Merge_Block() D:\tbs\tbs_class.php:676
    0.0492    3800864   4. clsTinyButStrong->meth_Locator_FindBlockLst() D:\tbs\tbs_class.php:1780
    0.0507    3839056   5. clsTinyButStrong->meth_Locator_SectionNewBDef() D:\tbs\tbs_class.php:1640

For some reason, when I first tested this, I did not see the above error, TBS completed, and generated my output document, *but* it still wasn't correct.  As I mentioned above, it looks like TBS only replaced the wp:docPr#id on only one element per page, not the 5 or 6 that need to be handled.  But now, it seems like any time I add that onformat/att code to the main (drawing elements anchored to) paragraph, I get that above error.

I'm sorry to continue asking for your time on this issue, but I wonder if you have any guesses about what I'm doing wrong or what else I could try to troubleshoot this.
By: John Lawler
Date: 2012-02-14
Time: 19:34

Re: Another Corrupt Simple Word (docx) File Problem

One quick follow up to the above.  I took a template that was giving me the above "Undefined property" errors and made a single simple change to it: I added one more line (drawing element) to the document, in the middle of the page, saved it, and re-ran, and now the errors don't come out!

The document still isn't right (it has repeated docPr ids, as before), but there're no errors.  I don't know if this is just a fluke or what, but it seems to have to do with the att=wp:docPr#id code, because if I take that out, I'm pretty sure I never see the errors.

So, I don't know if that is a bug, and if so if it's related to the main bug or missing feature of being aware of those docPr id's, but thought I'd mention it.  Anyway, I'm still working away at this trying to get my workaround to happen.
By: Skrol29
Date: 2012-02-14
Time: 22:08

Re: Another Corrupt Simple Word (docx) File Problem

> Out of curiosity, *how* did you?

- simplify as much as possible the template that produce the error,
- extract erroneous XML content, indent it, insert it back in the DOCX
- then you get the line of the error (it was in the <drawing> element)
- let Word fix the document, and compare the fixed XML with the erroneous XML.

> I'm getting this frustrating error when running the PHP page:

This is a very strange and not listed error. It would be interesting for me to have a small template and PHP snippet that reproduce this error.

> Anyway, I'm still working away at this trying to get my workaround to happen.

If you can wait a bit, I'm about to make a fix for the <wp:docPr> in OpenTBS.


By: John Lawler
Date: 2012-02-14
Time: 23:56

Re: Another Corrupt Simple Word (docx) File Problem

Thanks for the troubleshooting explanation.  I've been troubleshooting pure Word 2003 XML documents for awhile and have probably used similar approaches when doing that, but didn't realize that if you use 'tidy -xml' (e.g.) on the word\document.xml file, and stick it back in the docx, Word may be more specific about exactly where to look for the problem when the XML is broken down into many lines instead of just a few as it is by default.

Also, if you're planning to release a fix update to OpenTBS in the next couple of days, I'll just wait on it.  I do have a project coming up soon that I'd like to do with TBS, but I can wait a couple of days.

Oh, also I just a few minutes ago I sent you an email containing a example of how I receive the nasty "Undefined property" errors.  Thanks for your continued interest and work.
By: Skrol29
Date: 2012-02-15
Time: 01:10

Re: Another Corrupt Simple Word (docx) File Problem

Hi John,

OpenTBS 1.7.5 is released. It is supposed to magically renumber the <wp:docPr> ids when appropriate.
Thank you for your help on this point.

Now I will look at the package you've send to me in order to understand this strange new error.

By: John Lawler
Date: 2012-02-15
Time: 16:39

Re: Another Corrupt Simple Word (docx) File Problem

Thanks so much for your work on this.  I can see by the extra code you added this was probably not a trivial fix for you.  In my initial sample test here, that beautifully fixes the problems I was having previously with my drawing-heavy template.

I should be able to test any patch you're able to build for the other error if you get that complete too.

Thanks again.
By: Anonymous
Date: 2012-05-13
Time: 12:59

Re: Another Corrupt Simple Word (docx) File Problem

It's quite possible your .docx file was corrupted. You may fix the file due to Recovery Toolbox for Word. Below you may download the tool

http://www.recoverytoolbox.com/repair_word.html
By: N. Sem
Date: 2012-09-13
Time: 13:33

Re: Another Corrupt Simple Word (docx) File Problem

There is an article written on Idea Marketers, here is the link: http://www.ideamarketers.com/?how_to_repair_word_document&articleid=3557622. This article describes how a Word file gets corrupted and how one can repair it. A tool named as SysInfoTools MS Word repair is mentioned there in the article that claims to fix heavy corruption from corrupt Word files.   
By: Michael
Date: 2012-09-14
Time: 06:31

Re: Another Corrupt Simple Word (docx) File Problem

Thanks N. Sem for sharing this article here. I read the article but there are five utilities for particular MS Word file. I thought may there is one which supports all Word files recovery.
But anyways, thanks.