GravatarBen Ramey's Blog
Scripture, programming problems, solutions and stories.

Formatting multiple XML files at a time

I recently had a situation where I needed to compare many XML files generated by a program of one version to the same set of XML files produced by a previous version of the same program. Unfortunately, the sets of XML files were formatted differently and so doing a file comparison with Beyond Compare (a GREAT file comparison tool, by the way) was going to be useless.

So, I started looking for a way to quickly format all the files in each set the same way with one program. I looked into using Notepad++ which has a great XML Tools plugin (look for it under Plugins > Plugin Manager). I tried combining the plugin’s formatting commands with a macro that would format the XML file, save it and close it. So, I could easily open the few hundred files I had to format in Notepad++ (one set at a time), then run the macro multiple times (Macro > Run a Macro Multiple Times…). This would run through each file until all were formatted and closed. However, after working with it for a while, I couldn’t get the Notepad++ macro system to actual perform the XML Tools plugin format command. The macro would successfully run, saving and closing the file. But, when I checked the files they had not been formatted. I worked with it for a while, but could not figure out what the matter was.

I knew the real solution had to be some type of command-line utility and a batch file. So, I started looking into that. The solution I ended up with was just that.

First of all, I found HTML Tidy which I could run from the Windows command line to format a file. Using a configuration file for the tidy.exe (placed in the same directory as tidy.exe and named tidcfg.ini–although neither matters, see below) that looked like this:

indent:yes
indent-attributes:yes

I got the formatting I wanted.

Now, all I had to do was brush up on my Windows batch command skills to run tidy.exe on multiple files. Easy enough! This is what my batch file looked like:

for /d %%X in (C:\<path_to_parent_directory>\*) do (c:\<path_to_tidy.exe>\tidy.exe -m -xml -config c:\<path_to_tidy.ini_file>\tidycfg.ini %%X\<xml_file_name>.xml)

I had a folder structure where there were hundreds of directories inside this one parent directory. Each of the child directories had a single XML file in it. Therefore, I needed the C:<path_to_parent_directory>* wildcard.

So, what this batch file does is simply look at each child directory (with the /d switch) in my parent directory. In each directory it runs (do) the tidy.exe program, tells it to modify the input file itself (-m) instead of saving the formatted XML to another file, tells it that the input file is valid XML (-xml) and then tells it where the tidycfg.ini file is (-config). Finally, it tells tidy.exe to take the current directory (%%X) and use the <xml_file_name>.xml file as the input file to format.

This little set up worked very well and quickly formatted all of my files in the same way so that I could successfully compare them with Beyond Compare.

Comments