Friday, July 27, 2012

I just test, no qualifiers.

About two weeks ago I got into a debate with a teammate about unit testing Hadoop Pig scripts. My colleague's view was that only the UDFs utilized by the scripts should be unit tested. No one should use PigUnit to unit test a script. He thought the whole idea was silly. He drew an equivalence to attempting to unit test a SQL call. Would anyone ever be silly enough to actually test a select statement? My answer is "I do!"

The debate was over a mix of semantics and philosophy. He was getting caught up in the use of the word unit when I said I wanted to unit test. To him, a pig script, or apparently a SQL statement is a non-unit testable component because it requires integration with some supporting system. One cannot easily mock Pig (you can mock a database, but there is some debate as to the necessity of such actions). In his mind having the direct dependency on Pig made the test an integration test and, as such, should not require regression tests or even correctness tests.

I wanted to have a repeatable set of tests that proved that each major component of the script was doing its job. Our scripts are not simple scripts that are tossed about as examples in various blogs. Our scripts often require 10+ macro definitions, each with 5+ lines of PIG often requiring UDFs. To not test such code is negligence.  Our entire system requires these scripts to rip through gigs of data looking for complex patterns and find fraud. We have to know before deploying the code to even the developer's cluster that they work as expected over a variety of use cases.

As a result of this discussion, I've come to the conclusion that no one should "unit test". They should instead just test. Qualifying the test type just opens up the conversation to religious debate. The goal should be to have test coverage at at least 80% of the code base. The goal should be to do this in such a way as to isolate the code that you really care about when you test.

Looking at the testing problem this way might be a substantial change to how one develops regularly.  For example, I don't black box test my code. I have a set of tests that check the conformance to the contracts, but I also test the underlying code to make sure that its implementation specific features are doing what they should. If I'm testing a DAO, I make the test transactional so I can insert or delete data to check that my DAO's data mappings are working. Is this a unit test? No, probably not. Should this be part of the normal build? Absolutely! If you don't have such tests as mainline tests, you could be shipping a product that will fail once deployed.

Approaching the problem from this perspective will improve your code quality. It doesn't add too much more time to the development cycle. It might even save it if you are in an IBM shop where WAS takes 2-3 minutes to boot, let alone the added time to navigate the system and get to your change. This approach works well with both TDD and BDD concepts.

There are two points when you should start wondering if you are testing properly. The first is if your builds start taking more than a few minutes to complete as a result of the testing phase. This might mean you have too many small tests that could be consolidated. It might mean you've got a test that is acting as an integration test where a it should be refactored to use mocks and isolate the components. The second is when have a component that has high coupling with external modules. If you cannot use DI to slip in mocks to other parts, you probably aren't testing well and you probably aren't designing well. Inversion of control will help break down your problem into smaller, testable bite-size parts.

Avoiding labels is not always possible. But even when you can't, please remember that the label is just a concept to help you frame the problem. When you start fighting over the label's font size, so to speak, and not solving the problem, you've got to find your way back to the path and carry on.

Tuesday, May 22, 2012

Why I'm looking at Erlang over Scala

For the past few months I've been trying to learn a bit about Scala. It is a complex language full of nuance. How a Map can function like a Perl associative array is a very interesting feat of reflection, or compile time interface addition. But behind all of its complexity, something dark has been nagging at me. I've not been able to place my concern until using RabbitMQ in Java. But now I think I'll move on past the language to something else.
The pebble stuck in my craw is the dual nature of Scala: objective and functional. Like most dualism, I have a hard time understanding it. But beyond that, I think that the dualism plus Scala's desire to leverage existing Java libraries are inherently problematic from a daily development perspective.
The dualism is troublesome because it gives a safe veneer of functional to a seedy underbelly of mutable objects. Sure this function is passable, says Scala, but inside there be side effects. This is even more troubling when one couples older, non thread-safe Java code to a functional looking Scala snippet. Now we've lost our scalablity. We have to worry about state because our old Java 1.4 library is unsafe. But since we're thinking in Scala, this old way of life is possibly, probably, lost on us. Now we get to deal with random threading issues like in days of yore.
This is the threat to a Scala project: old busted Java getting coupled somewhere along the line with our nice, new pseudo-functional system. Our vals vs vars are undone by one call to some open source project that didn't worry enough about threading to even warn the user that the object is not thread safe. All it takes is one rouge developer in the project to totally undo the whole thing like this (and we probably all know that one guy; that guy who adds a random JAR to the project, but it's not really compatible with the other version of some other JAR already in the project, but you don't find out till run time 3 days after going to prod).
As a result of my choking on this problem, I've looked around for alternatives. Clojure looked good. Except that, even as a LISP, it suffers from the same cross-compatible Java calling ability that Scala provides. Side effects are possible, even when I don't mean it.
Fortunately, I've been working with RabbitMQ for the past month. As part of my trials I created random messages to dead-letter drops to test performance. With about 24 M of memory I can send 5K messages a second (about 1k per message) to the queue. When I attach a listener to the other end, I can sustain that type of throughput in my meager 256M Oracle Linux VM I've got hosting RabbitMQ. This wonderful little broker is written in Erlang.
Now I'm a pretty great cynic when it comes to technology. I don't drink the kool-aid (I never got into Ruby on Rails because I don't believe that Active Record solves all my problems) very easily and when I do, I don't drink much. But Rabbit got me thinking. With REST (some kool-aid, but not too much since in my mind it's autumn when it comes to web communication) and functional I think there could be a great opportunity for scaling. Couple this with Backbone.js or Ember.js and a whole new world of fast, user-friendly enterprise and consumer apps open up.
As a result of my mild dreams I'm looking at Erlang. Rabbit does well with it. It minimizes side effects to its actors' internal state. It doesn't scare me as much as Haskell (I'm looking at you Haskell fans on HN, damn Haskell fans you scary!). It doesn't run on the JVM and so can't bring in old, busted JARs! All totaled, I think that this will bring a pretty good opportunity to me.
Time will tell, but that's my reason from looking beyond Scala to something more functional. Please let me know your thoughts.



Thanks,
JPD