Wednesday, April 2, 2014

Getting A PDF To Work Well On A Kobo

The mournful prelude

Over the past few years I've amassed a collection of PDFs on various topics. They were stored with the best of intentions. "Surely one day I will read this", I told to myself. Sadly, I never got around to actually reading them.

I have an iPad, but I dislike reading PDFs on it. Something about it doesn't mesh. I read PDFs I've printed. But printing doesn't scale both in cost and logistically.

I tried an eBook reader, a first generation Kobo. It worked pretty well for the one book I bought from Kobo. It failed miserably to work with PDFs. Zooming in and out on a slow e-ink screen was too painful. Even the new ones look like they're pretty poor with PDFs (https://www.youtube.com/watch?v=wkWVaPw3Fgs).

The problem is not the reader. The problem is the format. PDFs are not intended to flow naturally like HTML. They are a rigid document intended to reproduce an exact copy onto the paper for which they were designed.

My problem is compounded by the fact that most of the PDFs that I want to read are two columned. This makes them near impossible to view on a little 6" screen.

As a result, I left the PDFs to rot on Dropbox's servers. That is until the past few nights.

Revising my tool set

Recently my life changed. This gave me more time and need to read my PDFs than ever before. I now live in Florida. I'm also self-publishing a book on NoSQL (New Data For Managers). I should be able to research the content while sitting on a beach just 45 minutes from my house. A beach means that the iPad is out.

After a few hours of research on PDF to EPUB, I found Calibre is the best open source option. I used it back in the 0.4 or 0.5 days. It worked then to get non-purchased books onto the Kobo. Great for adding Project Gutenberg stuff. Terrible at adding PDF derived works. I gave it another chance though.

It's still fairly crummy at getting PDFs into a usable EPUB. Footers are included. They show up at random spots. Headers are the same. Random pagination from the PDF footers occurs throughout the text. Really distracting. Often those make the text unreadable.

To compound the issue, EPUBs might have just "2" pages. The cover and a really long page that's got all of the PDF content. If this happens, it's impossible to jump to page "40". Instead you have to forward through the device 40 times.

Turns out that there is a command line tool for adjusting 2 column layouts into normal page layouts. It can also crop the footers and headers off. Since I'm not a huge fan of command line tools involving dimensions in a PDF format, I looked to see if someone fronted it with a GUI.

Turns out that there is such a GUI. Using this and the command line tool made it possible to convert a journal formatted PDF into a readable EPUB. While the output is not perfect (in my experience an occasional last line gets cropped off at random), it is highly readable. The document is easy to load and read. So I'm probably off to the beach since I have to go to a bigger city to rollover my 401k anyway.

How to make the magic happen

  1. Download and install the following tools.
    Calibre - http://calibre-ebook.com/ (I found the website to be ahead of Linux Mint's repo).
    k2pdfopt - http://www.willus.com/k2pdfopt/
    journal2ebook - https://github.com/adasilva/journal2ebook

    You need to add k2pdfopt to your path. This enables journal2ebook to see it.
  2. Take a complex PDF like Amazon's Paper on Dynamo (http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf)
    and open it with journal2ebook.py. 
  3. Crop the PDF and click the "Ready!" button in the lower right. It will ask you where to save the PDF. This step will reflow the 2 column to a single page format.
  4. Open Calibre and add the document.
  5. Right click on the PDF and select Convert. In the wizard choose EPUB as the output format. Don't bother with the rest of the setting yet since most likely you will actually get no text from the PDF but a bunch of images. Finalize the conversion.
  6. On the book, right click again, but this time choose "Edit Book". Here you'll get to see inside the EPUB. There should be a file called index*.html in the tree to the left. Open that. The right click on a <p> and choose "Split at multiple locations". This will open a wizard. There is an option to enter a tag. You want to simply type "p". This tells Calibre to turn every <p> into it's own page.
  7. Now open the file "content.opf". Here you want to find the entry for the original index.html file. You should delete it in both it's entry under the <manifest> and under <spine> This will make the really big page disappear from the reader's view (you can also delete the index file).
  8. Finally save the EPUB changes under the File menu.

Non-happy path

You might have a really large document that gets more than one index.html. You will have to eyeball the output then to see exactly what you have to change. But the general idea is the same.

Enjoy

At this point you should be able to push the file to your reader (Kobo or otherwise). Most of the page should be visible. Occasionally the image might be a touch too large for the reader.

The nice thing is I can read at the beach now and don't have to spend any money. This is great since I'm bootstrapping my own consulting/product firm. Money in the pocket means food on the table. Until next time, thoughtful reading.

PS

One thing to keep in mind is that you can't increase the font size since you're dealing with pictures. You should be able to recreate the EPUB using the steps above, but play with some of the settings.

Tuesday, February 18, 2014

Getting stung by the Hive

On my previous project, I got to work with Hive. To quickly introduce the tool, it's an Apache sponsored SQL façade over Hadoop. Being Java based, it has a fledgling JDBC driver. Being "scalable" it has a few Thrift services. Being POSIX based, it has a command line interface. My task was figuring out how to ETL raw ANSI COBOL files into wide Hive tables. The following post discusses where I was stung and the ointment found. Unlike Thomas in "My Girl"[0] I survived the Hive and got the system into production.

One note: I was using Hive v.11. So perhaps v.12 will bring more to the table.

Warning from the drones

This raised flags with me, but I had to soilder through. Hive has a Wiki. It seems to have several. The Confluence one is authoritive, but not necessarily the highest ranked in Google. In my experience the documentation at best is incomplete. There doesn't seem to be a large community around Hive. At least it's not as easy to find as Cascading. So that was a glancing sting. If you've got the money, I recommend getting the Hive book. If you don't or prefer to not pay for something with YARN staring you in the face, well, hopefully more blog posts like this come along.

First blood -- Interacting with Hive

There are three major ways to talk with Hive. The recommended is via JDBC. But for those who enjoy a more raw client/server process, there's Hive Servers I && II. Finally there are two CLIs: hive and beeswax. Your need will influence which one or combination of interaction models you'll chose.

Presently the SQL understood by Hive's JDBC driver is query focused. You can name databases in the query such as "SELECT * FROM MY_DB.SOME_TABLE LIMIT 5;" It is not DDL focused. You cannot say "ALTER TABLE MY_DB.SOME_TABLE ..." The alter statement does not allow a database predicate. If you need to use the database and you want to use the JDBC driver, you have to specify the database in the connection URL passed to the driver and construction time.

How I was stung: my system allows the user to override the database name at runtime for DDL statements. You cannot (currently) call setCatalog in the JDBC driver and affect a change[1]. The request to get this ability is old, and apparently abandoned [2]. It is also not possible to leverage the "USE <<DATABASE>>" statement in the DDL. The parser will fail. I tried a few ways around this such as "USE DATABASE && DDL"; did not work. If you need to execute DDL against a database other than that specified on the connection URL, JDBC is not presently your friend.

With JDBC a bust for my daily needs, I switched focus to the services. Hive Server 1 is a depricated attempt at services. Hive Server 2 is the replacement. It supports multiple client executions. Either would be fine.

How I was stung: I could not find any meaningful examples of how to use either server for DDL. When sent a DDL such as "CREATE TABLE MY_DATABASE.SOME_TABLE (COL1 STRING)", the server would return a failure message because "STRING" was an unknown type. To be fair, I've never used a Thrift client before. But in my opinion the Java client should be one of the best documented versions out there.

My solution was to fall back to the command line interface from the Java code[3]. The hive command is visible on the application account. It also allows you to invoke a script using -f. Finally scripts support USE DB. Using this confluence we can generate a script like this to any necessary complexity.

USE ${db};

ALTER TABLE ${table} ${columns};

Second wave -- LOAD does not mean what you think it means

Let me start this section by taking full blame for my issues. The documentation clearly says, "Hive does not do any transformation while loading data into tables. Load operations are currently pure copy/move operations that move datafiles into locations corresponding to Hive tables."[4] This little section is easy to miss in the heady days when you're first working to get Hive up and loading under a tight deadline.

As it says Hive JUST MOVES FILES! If you are storing the data as simple text, you're golden. Move happens to work. If you use any other form of SerDe, you're in for a suprise.

How I was stung: LOADing text into an ORC backed table will corrupt the table. Data will "load" just fine. But when you run a query you'll get runtime conversion issues that prevent getting any results.

Getting around this requires double inserting the data. First we create a clone of the target table but with a basic LazySerDe. Second, use "LOAD" to move data into the temp table. Finally use "INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM temp_table;" The INSERT statement causes the SerDe that backs the final table to convert the rows from the SELECT into the its storage.

Now a short note about the performance ramifications of the double load. You will incur at least one mapper job. In my experience loading an ORC backed table in this manner results in 3 job executions per load. Play around with the backing SerDe you need to balance load vs query execution.

Here's a GIST

Third wave -- Delete? What delete?

Hive is write forward only. This makes sense since it sits on HDFS, which is write forward only as well. As a result, deleting one or a few rows requires selecting what should survive into a temp table, deleting the old table and then renaming the temp to the original table.

How I was stung: I had to delete a load file's worth of data from the table because something in it could have been wrong. Because we had to seemlessly support analytic work while loading data, the full filter/replace process above wouldn't work.

There are two possible solutions for this. The first is to back the table with HBase. This makes it possible to delete based on a query. If you chose this option you will have to run the delete at the HBase level. This means that Hive is not as IT/admin free as it could be. Also, HBase is not as fast as natively backed Hive tables. In our investigation spike it was about 25% slower than Hive lazy SerDe.

The second option leverages the only native delete like function in Hive, partitions. Hive partitions are merely an abstraction in how the data stored. Let's say your table looked was created as follows

There will be an additional set of directories under /page_view with 'country=us', for example. When you INSERT into a table with OVERWRITE and PARTITIONED data, you are telling Hive to delete the partition directory if it exists and load the new data into it.

Watch out with partition. If you're partitions are too small, you're going to have a bad time. Remember, you want to get your data to at least a block size in order to get the biggest back for your buck.

Reflecting back at it, I think Hive's biggest issue is its maturity. Missing features, missing abstractions and missing documentation all come with being young. As it stands, it should be a good tool for querying. Until it matures a bit IT/DevOps will have to play an active role in maintaining the data on a day-to-day basis.

References:

0 - http://www.imdb.com/title/tt0102492/

1 - http://svn.apache.org/viewvc/hive/trunk/jdbc/src/java/org/apache/hive/jdbc/HiveConnection.java?view=markup

2 - https://issues.apache.org/jira/browse/HIVE-2320

3 - http://docs.oracle.com/javase/7/docs/api/java/lang/Process.html

4 - https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Loadingfilesintotables

Monday, February 3, 2014

OpenDNS on an ASUS RT-N66R

For anyone who's got this piece of equipment, you might have looked around and found no place to configure an alternative DNS. I searched the Internet. Did not get much help.

Turns out the setting is right there; just a little obfuscated. Log in and got to WAN -> Wan DNS Settings. Select "No" in for your answer to "Connect to DNS Server automatically". This will reveal two new input boxes. Drop in the OpenDNS settings. Click apply. Run an update on the OpenDNS updater.

Not really hard, but it wasn't like other routers where the boxes are always there.