REXML is an XML processor for the language Ruby. REXML is conformant (passes 100% of the Oasis non-validating tests), and includes full XPath support. It is reasonably fast, and is implemented in pure Ruby. Best of all, it has a clean, intuitive API.
This software is distribute under the Ruby license.
Why REXML? There, at the time of this writing, already two XML parsers for Ruby. The first is a Ruby binding to a native XML parser. This is a fast parser, using proven technology. However, it isn't very portable. The second is a native Ruby implementation, and as useful as it is, it has (IMO) a difficult API.
I have this problem: I dislike obfuscated APIs. There are several XML parser APIs for Java. Most of them follow DOM or SAX, and are very similar in philosophy with an increasing number of Java APIs. Namely, they look like they were designed by theorists who never had to use their own APIs. The extant XML APIs, in general, suck. They take a markup language which was specifically designed to be very simple, elegant, and powerful, and wrap an obnoxious, bloated, and large API around it. I was always having to refer to the API documentation to do even the most basic XML tree manipulations; nothing was intuitive, and almost every operation was complex.
Then along came Electric XML.
Ah, bliss. Look at the Electric XML API. First, the library is small; less than 500K. Next, the API is intuitive. You want to parse a document? doc = new Document( some_file ). Create and add a new element? element = parent.addElement( tag_name ). Write out a subtree?? element.write( writer ). Now how about DOM? To parse some file: parser = new DOMParser(); parser.parse( new InputSource( new FileInputStream( some_file ) ) ) Create a new element? First you have to know the owning document of the to-be-created node (can anyone say "global variables, or obtuse, multi-argument methods"?) and call element = doc.createElement( tag_name ) parent.appendChild( element ) "appendChild"? Where did they get that from? How many different methods do we have in Java in how many different classes for adding children to parents? addElement()? add()? put()? appendChild()? Heaven forbid that you want to create an Element elsewhere in the code without having access to the owning document. I'm not even going to go into what travesty of code you have to go through to write out an XML sub-tree in DOM.
So, I use Electric XML extensively. It is small, fast, and intuitive. IE, the API doesn't add a bunch of work to the task of writing software. When I started to write more software in Ruby, I needed an XML parser. I wasn't keen on the native library binding, "XMLParser", because I try to avoid complex library dependancies in my software, when I can. For a long time, I used NQXML, because it was the only other parser out there. However, the NQXML API can be even more painful than the Java DOM API. Almost all element operations requires accessing some indirect node access... you had to do something like element.node.attr['key'], and it is never obvious to me when you access the element directly, or the node.. or, really, why they're two different objects, anyway. This is even more unfortunate since Ruby is so elegent and intuitive, and bad APIs really stand out. I'm not, by the way, trying to insult NQXML; I just don't like the API.
I wrote the people at TheMind (Electric XML... get it?) and asked them if I could do a translation to Ruby. They said yes. After a few weeks of hacking on it for a couple of hours each week, and after having gone down a few blind alleys in the translation, I had a working beta. IE, it parsed, but hadn't gone through a lot of strenuous testing. Along the way, I had made a few changes to the API, and a lot of changes to the code. First off, Ruby does iterators differently than Java. Java uses a lot of helper classes. Helper classes are exactly the kinds of things that theorists come up with... they look good on paper, but using them is like chewing glass. You find that you spend 50% of your time writing helper classes just to support the other 50% of the code that actually does the job you were trying to solve in the first place. In this case, the Java helper classes are either Enumerations or Iterators. Ruby, on the other hand, uses blocks, which is much more elegant. Rather than:
for (Enumeration e=parent.getChildren(); e.hasMoreElements(); ) { Element child = (Element)e.nextElement(); // Do something with child }
you get:
parent.each_child{ |child| # Do something with child }
Can't you feel the peace and contentment in this block of code? Ruby is the language Buddha would have programmed in.
Anyhoo, I chose to use blocks in REXML directly, since this is more common to Ruby code than for x in y ... end, which is as orthoganal to the original Java as possible.
Also, I changed the naming conventions to more Ruby-esque method names. For example, the Java method getAttributeValue() becomes in Ruby get_attribute_value(). This is a toss-up. I actually like the Java naming convention more1, but the latter is more common in Ruby code, and I'm trying to make things easy for Ruby programmers, not Java programmers.
The biggest change was in the code. The Java version of Electric XML did a lot of efficient String-array parsing, character by character. Ruby, however, has ubiquitous, efficient, and powerful regular expression support. All regex functions are done in native code, so it is very fast, and the power of Ruby regex rivals that of Perl. Therefore, a direct conversion of the Java code to Ruby would have been more difficult, and much slower, than using Ruby regexps. I therefore used regexs. In doing so, I cut the number of lines of sourcecode by half.
Finally, by this point the API looks almost nothing like the original Electric XML API, and practically none of the code is even vaguely similar. However, even though the actual code is completely different, I did borrow the same process of processing XML as Electric, and am deeply indebted to the Electric XML code for inspiration.
One last thing. If you use and like this software, and you feel compelled to make some contribution to the author by way of saying "thanks", and you happen to know what a tea cozy is and where to get them, then you can send me one. Send those puppies to: Sean Russell 606 S. Gulph Ct. #321 King of Prussia, PA 19406 USA If you're outside of the US, make sure you write "gift" on it to avoid the taxes. If you don't want to send a tea cozy, you can also send money. Or don't send anything. Offer me a job I can't refuse, in Western Europe somewhere.
Run ruby bin/install.rb. By the way, you really should look at these sorts of files before you run them as root. They could contain anything, and since (in Ruby, at least) they tend to be mercifully short, it doesn't hurt to glance over them. If you want to uninstall REXML, run ruby bin/install.rb -u.
If you have Test::Unit installed, you can run the unit test cases. You can run both installed and not installed tests; to run the tests before installing REXML, run ruby -I. bin/suite.rb. To run them with an installed REXML, use ruby bin/suite.rb.
There is a benchmark suite in benchmarks/. To run the benchmarks, change into that directory and run ruby comparison.rb. If you have nothing else installed, only the benchmarks for REXML will be run. However, if you have any of the following installed, benchmarks for those tools will also be run:
The results will be written to index.html.
Please see the Tutorial.
The API documentation is available on-line, or it can be downloaded as an archive in tgz format (~70Kb) or (if you're a masochist) in zip format (~280Kb). The best solution is to download and install Dave Thomas' most excellent rdoc and generate the API docs yourself; then you'll be sure to have the latest API docs and won't have to keep downloading the doc archive.
The unit tests in test/ and the benchmarking code in benchmark/ provide additional examples of using REXML. The Tutorial provides examples with commentary. The documentation unpacks into rexml/doc.
Kouhei Sutou maintains a Japanese version of the REXML API docs. Kou's documentation page contains links to binary archives for various versions of the documentation.
Unfortunately, NQXML is the only package REXML can be compared against; XMLParser uses expat, which is a native library, and really is a different beast altogether. So in comparing NQXML and REXML you can look at four things: speed, size, completeness, and API.
REXML is faster than NQXML in some things, and slower than NQXML in a couple of things. You can see this for yourself by running the supplied benchmarks. Most of the places where REXML are slower are because of the convenience methods7. On the positive side, most of the convenience methods can be bypassed if you know what you are doing. Check the benchmark comparison page for a general comparison. You can look at the benchmark code yourself to decide how much salt to take with them.
The sizes of the XML parsers are close8. NQXML 1.1.3 has 1580 non-blank, non-comment lines of code; REXML 2.0 has 23409.
REXML is a conformant XML 1.0 parser. It supports multiple language encodings, and internal processing uses the required UTF-8 and UTF-16 encodings. It passes 100% of the Oasis non-validating tests. Furthermore, it provides a full implementation of XPath, a SAX2 and a PullParser API.
The last thing is the API, and this is where I think REXML wins. The core API is clean and intuitive, and things work the way you would expect them to. Convenience methods abound, and you can code for either convenience or speed. REXML code is terse, and readable, like Ruby code should be. The best way to decide which you like more is to write a couple of small applications in each, then use the one you're more comfortable with.
As of release 2.0, XPath 1.0 is fully implemented.
I fully expect bugs to crop up from time to time, so if you see any bogus XPath results, please let me know. That said, since I'm now following the XPath grammar and spec fairly closely, I suspect that you won't be surprised by REXML's XPath very often, and it should become rock solid fairly quickly.
Check the "bugs" section for known problems; there are little bits of XPath here and there that are not yet implemented, but I'll get to them soon.
Namespace support is rather odd, but it isn't my fault. I can only do so much and still conform to the specs. In particular, XPath attempts to help as much as possible. Therefore, in the trivial cases, you can pass namespace prefixes to Element.elements[...] and so on -- in these cases, XPath will use the namespace environment of the base element you're starting your XPath search from. However, if you want to do something more complex, like pass in your own namespace environment, you have to use the XPath first(), each(), and match() methods. Also, default namespaces force you to use the XPath methods, rather than the convenience methods, because there is no way for XPath to know what the mappings for the default namespaces should be. This is exactly why I loath namespaces -- a pox on the person(s) who thought them up!
Namespace support is now fairly stable. One thing to be aware of is that REXML is not (yet) a validating parser. This means that some invalid namespace declarations are not caught.
There is a low-volume mailing list dedicated to REXML. To subscribe, send an empty email to ser-rexml-subscribe@germane-software.com. This list is more or less spam proof. To unsubscribe, similarly send a message to ser-rexml-unsubscribe@germane-software.com.
An RSS file for REXML is now being generated from the change log. This allows you to be alerted of upgrades via 'pull' as they become available, if you have an RSS browser. This is an abuse of the RSS mechanism, which was intended to be a distribution system for headlines linked back to full articles, but it works. The headline for REXML is the version number, and the description is the change log. The links all link back to the REXML home page. The URL for the RSS itself is http://www.germane-software.com/software/rexml/rss.xml.
For those who are interested, there's a SLOCCount (by David A. Wheeler) file with stats on the REXML sourcecode. Note that the SLOCCount output includes the files in the test/, benchmarks/, and bin/ directories, as well as the main sourcecode for REXML itself.
You can submit bug reports and feature requests, and view the list of known bugs, at the REXML bug report page. Please do submit bug reports. If you really want your bug fixed fast, include an runit or Test::Unit method (or methods) that illustrates the problem. At the very least, send me some XML that REXML doesn't process properly.
You don't have to send an entire test suite -- just the unit test methods. If you don't send me a unit test, I'll have to write one myself, which will mean that your bug will take longer to fix.
When submitting bug reports, please include the version of Ruby and of REXML that you're using, and the operating system you're running on. Just run: ruby -vrrexml/rexml -e 'p REXML::Version,PLATFORM' and paste the results in your bug report.