Products Try & Buy Support News & Reviews About Panergy


What can you do with the docx format?

Is the docx format just a change from the doc format for the sake of changing, is this a case where Microsoft has gone after a new buzzword (in this case XML) or has this new format a value of its own?

Wouldn’t have been it simpler if Microsoft had officially published their proprietary Office formats for everyone to use as they did recently.

Actually it turns out that you can do quite a lot with the docx format. In the following I show a real world case study where I have used this format.

Modifying Word files - A case study

At Panergy we have written a doc converter and a docx converter and we know the doc, docx and rtf formats quite well so when I got a call from a friend explaining to me that he had troubles with the combination of Microsoft Word and Déjà Vu I was really getting a feeling of Déjà vu myself.

The problem environment

My friend has a translation business and in today’s environment translators are using highly specialized applications like Déjà Vu or others in order to do incremental translations especially for technical translations. This is quite logical, considering that each time you have a new version of a software package or a new model of an instrument you do not want to retranslate the whole manual but you only want to retranslate the delta and have it fit nicely in the old translation.

This is the theory and in general it works quite well. Except that in his case this theory was not working all that well. The reason was that the support for all languages is not created equal, not in Déjà vu and not in Word. Obviously, English is tested well, French and German too but what about all the Right to Left languages like Arabic or Hebrew. These are moderately tested and most of the testing is left to users. In general, in these languages most of the products are left with peculiar behaviors that are never fixed and the users have to find ways around in order to be productive.

What was the problem about?

Quite simply, my friend is translating technical manuals from English to Hebrew with the help of Déjà vu and Déjà vu was producing an RTF file. This RTF file was then opened in Word in order to review the whole document and to create a translated manual. Everything was looking fine in Word (see picture) except that my friend had to fix manually all the Right to Left and Left to Right runs in each paragraph where English and Hebrew were mixed as the pdf created was not right (see picture) and the English text couldn’t be modified without messing the whole paragraph. Since this happens all the time in technical documents he had to to spend 10 minutes per page to fix the formatting. So for a thirty page manual he was spending 5 hours just to reformat the manual.

In general the customers are large corporations that translates in all kind of languages and they never had such problems with "normal translations". So how do you explain the time delays and how do you charge your customer for this?

This extra formatting was costing a lot and was totally counter-productive as it distracted the translator from the review which was suffering in the process.

My friend asked me: what can you do to help me?

How to reformat a Word file automatically?

I was faced with finding a way to fix all the formatting with a program rather than manually. Normally, the solution would have been to use our RTF parser and to fix the RTF file or to do the same with the doc file. The only problem is that this is not a small project as there are a lot of cases when the reformatting should happen.

How long should take such a project?

Being a programmer is to be an optimist, otherwise we would never start any project because they may fail or take way too long. So considering this, I estimated that it may take me three weeks to get something decent working. Actually, even working a month on this project may have been worthwhile for my friend as he had just received a manual of 1000 pages to do and he was estimating that it will cost him one month of reformatting on this manual alone. But before starting it all, I thought that maybe, just maybe I could use known tools to do the job and that I would have not to write anything by myself except for some scripts to direct these tools (I am optimist and lazy too). That was an avenue worth exploring, at least I could give it a day of experimenting to see if this could be done. To complicate it all, my friend is working on Windows and I am developing on Mac, now with docXConverter we have gone cross-platform but we are still developing on Mac.

Docx format to the rescue

My idea was to exploit the docx format characteristics together with a good text editor and to modify the docx files in order to get the proper results. So I got an example file from my friend and here we start. First I converted my friend’s file from doc to docx by opening the doc file in Word 2007 and saving it on my Windows machine as a docx file. Then I moved it to the Mac.

Step 1: Opening the docx file

Dragging the docx file to Stuffit Expander gave me a folder with the name of the docx file. In this folder I had three more folder named rels, docProps and word and a file called [Content_Types].xml Since I wrote part of docXConverter I knew what I was looking for. I opened the folder called word and I saw three more folders (_rels, media and theme) and 13 xml files. The xml files are for footers, headers, endnotes, footnotes, settings, styles and for the text in the main part of the document which is in a file document.xml. Now all what’s left was to open this document.xml and to introduce an indication of an English Left to Right run when necessary. This was in general necessary when you had some English in the middle of Hebrew runs.

Step 2: Finding the right to left runs

Since I had my xml file I thought that all I needed to do was to open it in BBEdit and then to find my Right to Left runs. So I dragged the document.xml file to BBEdit and this is what I got an xml that was difficult to digest (see picture on the right). Well BBEdit is really showing what comes out of Word and I couldn’t really blame it but this was not very useful and I really couldn’t work with this file. Well, I thought that this is not such a problem, I could use the ElfData XML editor that I had purchased when developing docXConverter. So I opened my document.xml with ElfData XML editor (see picture on the right) and I got a much nicer display with all the indentation, in short something I could work with. So my first test was to manually find an English run that was defined like an Hebrew run and change its definition. I located the following run description:

Clearly we have two problems with this run: It is defined as a Right to Left run (by the tag "w:rtl") and as being in Hebrew (by the tag "w:lang w:val="he-IL""

Step 3: Fixing a sample run and checking the results

I fixed both tags manually, I replaced rtl by ltr and he-IL by en-US. En-US was really wrong as this text is in French but this was good enough for my testing. So I saved the document.xml file, then I did a zip compression of the folder containing the file parts with DropStuff, copied the result to my PC and went to check my results.
I opened the file with Word 2007 and then I got an error message: file was corrupted and is not recognized or something of that nature. What was going on here, I had been careful and certainly nothing I had done could have corrupted the file, I had really done some simple changes.
I tried to open the file with Winzip on the PC to see if Winzip could open it, may be it was really corrupted. Well, Winzip opened it but then I realized my mistake. I had compressed the whole folder and this is what Word 2007 was not able to swallow, the folder itself; I should have compressed only the parts and not the folder containing them. I recompressed the parts without the containing folder, renamed the file with a docx extension and tried again to open it in Word 2007. This time it opened fine and the English (or French) really appeared as a Left to Right run. I was excited, the proof of the concept was done. I called my friend and told him that maybe I could have something for him in a few days. He was excited too and myself I went for a coffee break. I had a proof of concept but I still needed a reasonable implementation that I could install at his place.

Step 4: Fixing all Right to Left runs with English text

Now I needed to change all the English runs defined as rtl to ltr and to change the language as well. I figured that the best way to do this would be to use grep in BBEdit. So I opened my document.xml in ElfData XML editor, saved it then opened it in BBEdit worked out the proper grep pattern did a Replace All, Saved, Compressed move to the PC and checked it. Well this whole thing worked. I was able to fix all the English runs within the Hebrew runs. The only problem is that this was good for me but this is not something I could install at my friend’s place. Today most of my friend’s work is done on Windows machines but he is still keeping two Macs in working order, one with Mac OS 10.3 and the other with Mac OS 9. So I had no problems to install my solution on one of the Macs but I certainly couldn’t teach them grep.

Step 5: Streamlining the process

The first step was to eliminate the need to open the document.xml data in the ElfData XML editor. What I did was just to reformat the file within BBEdit with another grep pattern that allowed me to put a carriage return between each lines of the xml file. This didn’t gave such a nice formatting with each level indented as ElfData but this was adequate for me. This formatting didn’t matter for the final xml file because Word ignore all the formatting between the xml tags so I had eliminated an extra step. Now I had two grep patterns and two replace all, time to automate this. I knew that BBEdit is doing scripts so I recorded these two Replace All and I got two scripts that I could execute from the Macros menu of BBedit. Now I was much more in business. I started to work more on the example file that I had and it turn out that there were six more patterns that I could replace and that could cut down formatting time. So I worked a while at my patterns, did some more macros and at the end I had seven working macros. I renamed them 01_format, 02_RTL, etcÉ and I was almost ready to go to my friend and install all this. I now had a much better process but I still had the unzip and zip process on the Mac and this was a little error prone. Since when something is going wrong I generally get a phone call, I was looking for an easier way. Turn out that there is something much easier and much safer. I used Winzip Pro to open the docx file and show it in a tree directory structure. Then I copied directly the document.xml from the tree in Winzip Pro on a USB disk on key. I plugged the disk on key on the Mac and drag the document.xml file from my disk on key to the BBEdit icon in the dock. I execute my macros and have them save the file at the end of the macros, then back to the PC, drag the document.xml file back in the Winzip directory and this it. This was a process in which I was confident and I figure that it would save my friend a lot of time which was the whole idea. I went to his place, purchased all the needed software, installed it all and it took 15 more minutes to explain the whole process.

Step 6: Streamlining more -- AppleScript versus text factories

All went well, I was very happy and I was beginning to think that I had done some good work. Well, I got two phone calls; one from my friend who was very polite and telling me that indeed the program was saving time but that the tables were still ltr instead of rtl and that may be I could fix some more stuff that he pointed out. He explained to me, that the program was saving 60% of the wasted time but that we should aim higher. I also got another call from the guy who was actually running the program and he told that everyone on the staff was laughing at him because he was executing the macros one after each other and was counting out loud one, two etcÉ until seven. He was asking if I could put all of them together so that his colleagues will stop laughing. I said OK to both phone calls and went to work on this. Turned out that the macros were in AppleScript, so combining them may not be a good idea as AppleScript is executing asynchronously and I couldn’t find out a good way to have one start executing and then wait until the execution is finished before launching the other one. Surely, there must be a better way. I went back to BBEdit and started to read the manual to try to dig up something better. Turn out that BBedit has a wonderful feature called text factory. You can build a text factory which is a succession of operations within BBedit, in my case a succession of Find/Replace. This was clearly an improvement, so I implemented this and added three more replacement patterns and now I had my ultimate text factory which was executing in one command.

Step 7: Wrapping up

With the text factory feature of BBEdit I had a very nice and extensible mechanism to improve my text processing of docx file. I was able to add some more patterns when I was receiving example of files that could be improved and was able to send the factory file by email so that they will just drop it the new factory in their Mac. Overall, I think that this small program is saving my friend 90%-95% of the extra formatting and he is very happy with the results. For me, I can say that the new docx format was very useful as I was able to achieve good results with standard tools in a fraction of the time it would have taken me to write a full fledge program. I am sure that there must be a text editor on Windows that can do the same as BBedit and this would streamline the process even more and may be I will research this if my friend’s Mac breaks down but for the moment everyone is happy.

docx +Winzip +BBEdit =Solution in3 days