Wednesday, April 2, 2014

Getting A PDF To Work Well On A Kobo

The mournful prelude

Over the past few years I've amassed a collection of PDFs on various topics. They were stored with the best of intentions. "Surely one day I will read this", I told to myself. Sadly, I never got around to actually reading them.

I have an iPad, but I dislike reading PDFs on it. Something about it doesn't mesh. I read PDFs I've printed. But printing doesn't scale both in cost and logistically.

I tried an eBook reader, a first generation Kobo. It worked pretty well for the one book I bought from Kobo. It failed miserably to work with PDFs. Zooming in and out on a slow e-ink screen was too painful. Even the new ones look like they're pretty poor with PDFs (https://www.youtube.com/watch?v=wkWVaPw3Fgs).

The problem is not the reader. The problem is the format. PDFs are not intended to flow naturally like HTML. They are a rigid document intended to reproduce an exact copy onto the paper for which they were designed.

My problem is compounded by the fact that most of the PDFs that I want to read are two columned. This makes them near impossible to view on a little 6" screen.

As a result, I left the PDFs to rot on Dropbox's servers. That is until the past few nights.

Revising my tool set

Recently my life changed. This gave me more time and need to read my PDFs than ever before. I now live in Florida. I'm also self-publishing a book on NoSQL (New Data For Managers). I should be able to research the content while sitting on a beach just 45 minutes from my house. A beach means that the iPad is out.

After a few hours of research on PDF to EPUB, I found Calibre is the best open source option. I used it back in the 0.4 or 0.5 days. It worked then to get non-purchased books onto the Kobo. Great for adding Project Gutenberg stuff. Terrible at adding PDF derived works. I gave it another chance though.

It's still fairly crummy at getting PDFs into a usable EPUB. Footers are included. They show up at random spots. Headers are the same. Random pagination from the PDF footers occurs throughout the text. Really distracting. Often those make the text unreadable.

To compound the issue, EPUBs might have just "2" pages. The cover and a really long page that's got all of the PDF content. If this happens, it's impossible to jump to page "40". Instead you have to forward through the device 40 times.

Turns out that there is a command line tool for adjusting 2 column layouts into normal page layouts. It can also crop the footers and headers off. Since I'm not a huge fan of command line tools involving dimensions in a PDF format, I looked to see if someone fronted it with a GUI.

Turns out that there is such a GUI. Using this and the command line tool made it possible to convert a journal formatted PDF into a readable EPUB. While the output is not perfect (in my experience an occasional last line gets cropped off at random), it is highly readable. The document is easy to load and read. So I'm probably off to the beach since I have to go to a bigger city to rollover my 401k anyway.

How to make the magic happen

  1. Download and install the following tools.
    Calibre - http://calibre-ebook.com/ (I found the website to be ahead of Linux Mint's repo).
    k2pdfopt - http://www.willus.com/k2pdfopt/
    journal2ebook - https://github.com/adasilva/journal2ebook

    You need to add k2pdfopt to your path. This enables journal2ebook to see it.
  2. Take a complex PDF like Amazon's Paper on Dynamo (http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf)
    and open it with journal2ebook.py. 
  3. Crop the PDF and click the "Ready!" button in the lower right. It will ask you where to save the PDF. This step will reflow the 2 column to a single page format.
  4. Open Calibre and add the document.
  5. Right click on the PDF and select Convert. In the wizard choose EPUB as the output format. Don't bother with the rest of the setting yet since most likely you will actually get no text from the PDF but a bunch of images. Finalize the conversion.
  6. On the book, right click again, but this time choose "Edit Book". Here you'll get to see inside the EPUB. There should be a file called index*.html in the tree to the left. Open that. The right click on a <p> and choose "Split at multiple locations". This will open a wizard. There is an option to enter a tag. You want to simply type "p". This tells Calibre to turn every <p> into it's own page.
  7. Now open the file "content.opf". Here you want to find the entry for the original index.html file. You should delete it in both it's entry under the <manifest> and under <spine> This will make the really big page disappear from the reader's view (you can also delete the index file).
  8. Finally save the EPUB changes under the File menu.

Non-happy path

You might have a really large document that gets more than one index.html. You will have to eyeball the output then to see exactly what you have to change. But the general idea is the same.

Enjoy

At this point you should be able to push the file to your reader (Kobo or otherwise). Most of the page should be visible. Occasionally the image might be a touch too large for the reader.

The nice thing is I can read at the beach now and don't have to spend any money. This is great since I'm bootstrapping my own consulting/product firm. Money in the pocket means food on the table. Until next time, thoughtful reading.

PS

One thing to keep in mind is that you can't increase the font size since you're dealing with pictures. You should be able to recreate the EPUB using the steps above, but play with some of the settings.

Tuesday, February 18, 2014

Getting stung by the Hive

On my previous project, I got to work with Hive. To quickly introduce the tool, it's an Apache sponsored SQL façade over Hadoop. Being Java based, it has a fledgling JDBC driver. Being "scalable" it has a few Thrift services. Being POSIX based, it has a command line interface. My task was figuring out how to ETL raw ANSI COBOL files into wide Hive tables. The following post discusses where I was stung and the ointment found. Unlike Thomas in "My Girl"[0] I survived the Hive and got the system into production.

One note: I was using Hive v.11. So perhaps v.12 will bring more to the table.

Warning from the drones

This raised flags with me, but I had to soilder through. Hive has a Wiki. It seems to have several. The Confluence one is authoritive, but not necessarily the highest ranked in Google. In my experience the documentation at best is incomplete. There doesn't seem to be a large community around Hive. At least it's not as easy to find as Cascading. So that was a glancing sting. If you've got the money, I recommend getting the Hive book. If you don't or prefer to not pay for something with YARN staring you in the face, well, hopefully more blog posts like this come along.

First blood -- Interacting with Hive

There are three major ways to talk with Hive. The recommended is via JDBC. But for those who enjoy a more raw client/server process, there's Hive Servers I && II. Finally there are two CLIs: hive and beeswax. Your need will influence which one or combination of interaction models you'll chose.

Presently the SQL understood by Hive's JDBC driver is query focused. You can name databases in the query such as "SELECT * FROM MY_DB.SOME_TABLE LIMIT 5;" It is not DDL focused. You cannot say "ALTER TABLE MY_DB.SOME_TABLE ..." The alter statement does not allow a database predicate. If you need to use the database and you want to use the JDBC driver, you have to specify the database in the connection URL passed to the driver and construction time.

How I was stung: my system allows the user to override the database name at runtime for DDL statements. You cannot (currently) call setCatalog in the JDBC driver and affect a change[1]. The request to get this ability is old, and apparently abandoned [2]. It is also not possible to leverage the "USE <<DATABASE>>" statement in the DDL. The parser will fail. I tried a few ways around this such as "USE DATABASE && DDL"; did not work. If you need to execute DDL against a database other than that specified on the connection URL, JDBC is not presently your friend.

With JDBC a bust for my daily needs, I switched focus to the services. Hive Server 1 is a depricated attempt at services. Hive Server 2 is the replacement. It supports multiple client executions. Either would be fine.

How I was stung: I could not find any meaningful examples of how to use either server for DDL. When sent a DDL such as "CREATE TABLE MY_DATABASE.SOME_TABLE (COL1 STRING)", the server would return a failure message because "STRING" was an unknown type. To be fair, I've never used a Thrift client before. But in my opinion the Java client should be one of the best documented versions out there.

My solution was to fall back to the command line interface from the Java code[3]. The hive command is visible on the application account. It also allows you to invoke a script using -f. Finally scripts support USE DB. Using this confluence we can generate a script like this to any necessary complexity.

USE ${db};

ALTER TABLE ${table} ${columns};

Second wave -- LOAD does not mean what you think it means

Let me start this section by taking full blame for my issues. The documentation clearly says, "Hive does not do any transformation while loading data into tables. Load operations are currently pure copy/move operations that move datafiles into locations corresponding to Hive tables."[4] This little section is easy to miss in the heady days when you're first working to get Hive up and loading under a tight deadline.

As it says Hive JUST MOVES FILES! If you are storing the data as simple text, you're golden. Move happens to work. If you use any other form of SerDe, you're in for a suprise.

How I was stung: LOADing text into an ORC backed table will corrupt the table. Data will "load" just fine. But when you run a query you'll get runtime conversion issues that prevent getting any results.

Getting around this requires double inserting the data. First we create a clone of the target table but with a basic LazySerDe. Second, use "LOAD" to move data into the temp table. Finally use "INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM temp_table;" The INSERT statement causes the SerDe that backs the final table to convert the rows from the SELECT into the its storage.

Now a short note about the performance ramifications of the double load. You will incur at least one mapper job. In my experience loading an ORC backed table in this manner results in 3 job executions per load. Play around with the backing SerDe you need to balance load vs query execution.

Here's a GIST

Third wave -- Delete? What delete?

Hive is write forward only. This makes sense since it sits on HDFS, which is write forward only as well. As a result, deleting one or a few rows requires selecting what should survive into a temp table, deleting the old table and then renaming the temp to the original table.

How I was stung: I had to delete a load file's worth of data from the table because something in it could have been wrong. Because we had to seemlessly support analytic work while loading data, the full filter/replace process above wouldn't work.

There are two possible solutions for this. The first is to back the table with HBase. This makes it possible to delete based on a query. If you chose this option you will have to run the delete at the HBase level. This means that Hive is not as IT/admin free as it could be. Also, HBase is not as fast as natively backed Hive tables. In our investigation spike it was about 25% slower than Hive lazy SerDe.

The second option leverages the only native delete like function in Hive, partitions. Hive partitions are merely an abstraction in how the data stored. Let's say your table looked was created as follows

There will be an additional set of directories under /page_view with 'country=us', for example. When you INSERT into a table with OVERWRITE and PARTITIONED data, you are telling Hive to delete the partition directory if it exists and load the new data into it.

Watch out with partition. If you're partitions are too small, you're going to have a bad time. Remember, you want to get your data to at least a block size in order to get the biggest back for your buck.

Reflecting back at it, I think Hive's biggest issue is its maturity. Missing features, missing abstractions and missing documentation all come with being young. As it stands, it should be a good tool for querying. Until it matures a bit IT/DevOps will have to play an active role in maintaining the data on a day-to-day basis.

References:

0 - http://www.imdb.com/title/tt0102492/

1 - http://svn.apache.org/viewvc/hive/trunk/jdbc/src/java/org/apache/hive/jdbc/HiveConnection.java?view=markup

2 - https://issues.apache.org/jira/browse/HIVE-2320

3 - http://docs.oracle.com/javase/7/docs/api/java/lang/Process.html

4 - https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Loadingfilesintotables

Monday, February 3, 2014

OpenDNS on an ASUS RT-N66R

For anyone who's got this piece of equipment, you might have looked around and found no place to configure an alternative DNS. I searched the Internet. Did not get much help.

Turns out the setting is right there; just a little obfuscated. Log in and got to WAN -> Wan DNS Settings. Select "No" in for your answer to "Connect to DNS Server automatically". This will reveal two new input boxes. Drop in the OpenDNS settings. Click apply. Run an update on the OpenDNS updater.

Not really hard, but it wasn't like other routers where the boxes are always there.

Monday, October 14, 2013

Dining alfreso

My brother-in-law is a budding writer. After talking with him, I think life would be better if a I took a few minutes out of my week to write short sketches of fiction. Hopefully they will teach me about writing and in turn show him support.

Short - Dining Alfresco.

What occurred before him was the singularly most pecular thing Ralph had ever seen. A small group of frog like humanoids were fighting three colorful praying mantis like creatures. Some might have called it a dream. Ralph wouldn't. Most of his dreams were about sex or his mother and the odd dream of sex with his mother. The vision of the tumult in front of him was more horrifying than anything he dreamed about.

The frogs attacked with tools and magic. Behind the fray stood a lone shaman casting what looked like force lightning at the buggies. Each bug would temporarily halt its advance with the shock, but it didn't deter them. On the line the frogs struck with machete-like instruments. Their scant clothing waving in the wind as they lunged at their foes. Exoskeletons cracked. The bugs pressed on.

In the center, and slightly behind the other two stood the largest bug. By its behavior it acted like a queen. It directed the others with shrill cries. Occasionally as the frog line advanced, it, being the largest, would strike from above at the heads or torsos of the frogmen. Each time it would strike a blow. Sometimes a deadly blow as a frog dropped to the ground oozing blood from the smashed skull.

After a few minutes the front line of the frogmen was no more. The pack of bugs wheeled around to kill outmatched shaman in the rear. Realizing its fate, the shaman added even more peculiarity to the situation by simply clapping its hands together while making an odd, low grumble.

Snicker-snack sank the queen's tibia through the brain pan of the shaman. It smoothly slid through the head and out the jaw. With the same quickness of its entry, the tibia recoiled out. The shaman dropped, limped and bloodied to the floor.


The bugs walked up the steps the frogs were trying so hard to defend. The egg sacks that kept their young were now ripe for the taking. Ralph left them to their meal. Lord knows if it was calorically positive for the bugs after the fight. Perhaps they will dine on the frog corpses like the French to make up any deficit.

Saturday, May 4, 2013

Rafting With the Elephant: Cascading and Configuration Of Steps

At work we've been having a time trying to efficiently leverage our Hadoop cluster. One major issue with our project is that we must ensure that no data duplication occur at any point in our processing. Such checks necessitate the use of a reducers.

The problem with reducers is that they really shouldn't process more than 10 GB [1]. In our case the system was schleping about 100 GB of data from  the various nodes to the single reducer since 1 reduce is Cascading's default setting. This made the processing arduous. However, there are parts where we need all of the data to go to just one reducer (statistical calculations). As a result we are unable to set at a macro level the job config to, say, 4 reducers.

This issue is compounded by the fact that we are currently using 2.0.8. This version has a bug in it where the Pipe.getStepConfigDef does nothing even on GroupBy or, in our case, the GroupBy within the Unique. Chris Wensel of Cascading Fame has acknowledged the bug [2]. Frankly, I can't get too mad at this for two reasons: 1) Cascading has to be hands down the best open source project I've ever used and 2) we're on an unsupported version of the framework so Chris' "too bad for you [3]" statement makes sense and they're working on it.

But bugs aside, I have a real problem; you might have the same problem. How does one, like me fix, this (especially when the flow is taking 16 hrs our a 24 hr window that resets every 16 hrs)? The answer lies in the magic of [FlowStepStrategy]. This interface allows us developers the opportunity of setting configurations before steps are sent out the door to the cluster. In our project we've been using it all along, but improperly. As a result it took me awhile to figure out how to leverage this great feature.

Basically the key to the trick I'm about to write is the argument param in the apply method, flowStep. This param has the method containsPipeNamed. By using this and the various dot files produced by the flow, you can figure out which step your pipes are firing in. Armed with this information you can implement this class looking for pipes with certain names. When you hit a pipe (preferably a a GroupBy named pipe), you can customize the job configurations. In our case we look for a GroupBy pipe with a certain name. When hit, we add "mapred.reduce.tasks" set to 4. This dramatically decreases the amount of time spent on reducing, but allows us to maintain the default of 1 reducer down the line. Or course you could add other configurations here as well.

Once you've implemented that interface you just add your special class to the flow. Then POW! things move faster.

Hopefully this will help others facing the same issue. I hope that the problem is solved in the future.




Friday, July 27, 2012

I just test, no qualifiers.

About two weeks ago I got into a debate with a teammate about unit testing Hadoop Pig scripts. My colleague's view was that only the UDFs utilized by the scripts should be unit tested. No one should use PigUnit to unit test a script. He thought the whole idea was silly. He drew an equivalence to attempting to unit test a SQL call. Would anyone ever be silly enough to actually test a select statement? My answer is "I do!"

The debate was over a mix of semantics and philosophy. He was getting caught up in the use of the word unit when I said I wanted to unit test. To him, a pig script, or apparently a SQL statement is a non-unit testable component because it requires integration with some supporting system. One cannot easily mock Pig (you can mock a database, but there is some debate as to the necessity of such actions). In his mind having the direct dependency on Pig made the test an integration test and, as such, should not require regression tests or even correctness tests.

I wanted to have a repeatable set of tests that proved that each major component of the script was doing its job. Our scripts are not simple scripts that are tossed about as examples in various blogs. Our scripts often require 10+ macro definitions, each with 5+ lines of PIG often requiring UDFs. To not test such code is negligence.  Our entire system requires these scripts to rip through gigs of data looking for complex patterns and find fraud. We have to know before deploying the code to even the developer's cluster that they work as expected over a variety of use cases.

As a result of this discussion, I've come to the conclusion that no one should "unit test". They should instead just test. Qualifying the test type just opens up the conversation to religious debate. The goal should be to have test coverage at at least 80% of the code base. The goal should be to do this in such a way as to isolate the code that you really care about when you test.

Looking at the testing problem this way might be a substantial change to how one develops regularly.  For example, I don't black box test my code. I have a set of tests that check the conformance to the contracts, but I also test the underlying code to make sure that its implementation specific features are doing what they should. If I'm testing a DAO, I make the test transactional so I can insert or delete data to check that my DAO's data mappings are working. Is this a unit test? No, probably not. Should this be part of the normal build? Absolutely! If you don't have such tests as mainline tests, you could be shipping a product that will fail once deployed.

Approaching the problem from this perspective will improve your code quality. It doesn't add too much more time to the development cycle. It might even save it if you are in an IBM shop where WAS takes 2-3 minutes to boot, let alone the added time to navigate the system and get to your change. This approach works well with both TDD and BDD concepts.

There are two points when you should start wondering if you are testing properly. The first is if your builds start taking more than a few minutes to complete as a result of the testing phase. This might mean you have too many small tests that could be consolidated. It might mean you've got a test that is acting as an integration test where a it should be refactored to use mocks and isolate the components. The second is when have a component that has high coupling with external modules. If you cannot use DI to slip in mocks to other parts, you probably aren't testing well and you probably aren't designing well. Inversion of control will help break down your problem into smaller, testable bite-size parts.

Avoiding labels is not always possible. But even when you can't, please remember that the label is just a concept to help you frame the problem. When you start fighting over the label's font size, so to speak, and not solving the problem, you've got to find your way back to the path and carry on.

Tuesday, May 22, 2012

Why I'm looking at Erlang over Scala

For the past few months I've been trying to learn a bit about Scala. It is a complex language full of nuance. How a Map can function like a Perl associative array is a very interesting feat of reflection, or compile time interface addition. But behind all of its complexity, something dark has been nagging at me. I've not been able to place my concern until using RabbitMQ in Java. But now I think I'll move on past the language to something else.
The pebble stuck in my craw is the dual nature of Scala: objective and functional. Like most dualism, I have a hard time understanding it. But beyond that, I think that the dualism plus Scala's desire to leverage existing Java libraries are inherently problematic from a daily development perspective.
The dualism is troublesome because it gives a safe veneer of functional to a seedy underbelly of mutable objects. Sure this function is passable, says Scala, but inside there be side effects. This is even more troubling when one couples older, non thread-safe Java code to a functional looking Scala snippet. Now we've lost our scalablity. We have to worry about state because our old Java 1.4 library is unsafe. But since we're thinking in Scala, this old way of life is possibly, probably, lost on us. Now we get to deal with random threading issues like in days of yore.
This is the threat to a Scala project: old busted Java getting coupled somewhere along the line with our nice, new pseudo-functional system. Our vals vs vars are undone by one call to some open source project that didn't worry enough about threading to even warn the user that the object is not thread safe. All it takes is one rouge developer in the project to totally undo the whole thing like this (and we probably all know that one guy; that guy who adds a random JAR to the project, but it's not really compatible with the other version of some other JAR already in the project, but you don't find out till run time 3 days after going to prod).
As a result of my choking on this problem, I've looked around for alternatives. Clojure looked good. Except that, even as a LISP, it suffers from the same cross-compatible Java calling ability that Scala provides. Side effects are possible, even when I don't mean it.
Fortunately, I've been working with RabbitMQ for the past month. As part of my trials I created random messages to dead-letter drops to test performance. With about 24 M of memory I can send 5K messages a second (about 1k per message) to the queue. When I attach a listener to the other end, I can sustain that type of throughput in my meager 256M Oracle Linux VM I've got hosting RabbitMQ. This wonderful little broker is written in Erlang.
Now I'm a pretty great cynic when it comes to technology. I don't drink the kool-aid (I never got into Ruby on Rails because I don't believe that Active Record solves all my problems) very easily and when I do, I don't drink much. But Rabbit got me thinking. With REST (some kool-aid, but not too much since in my mind it's autumn when it comes to web communication) and functional I think there could be a great opportunity for scaling. Couple this with Backbone.js or Ember.js and a whole new world of fast, user-friendly enterprise and consumer apps open up.
As a result of my mild dreams I'm looking at Erlang. Rabbit does well with it. It minimizes side effects to its actors' internal state. It doesn't scare me as much as Haskell (I'm looking at you Haskell fans on HN, damn Haskell fans you scary!). It doesn't run on the JVM and so can't bring in old, busted JARs! All totaled, I think that this will bring a pretty good opportunity to me.
Time will tell, but that's my reason from looking beyond Scala to something more functional. Please let me know your thoughts.



Thanks,
JPD