Tuesday, February 18, 2014

Getting stung by the Hive

On my previous project, I got to work with Hive. To quickly introduce the tool, it's an Apache sponsored SQL façade over Hadoop. Being Java based, it has a fledgling JDBC driver. Being "scalable" it has a few Thrift services. Being POSIX based, it has a command line interface. My task was figuring out how to ETL raw ANSI COBOL files into wide Hive tables. The following post discusses where I was stung and the ointment found. Unlike Thomas in "My Girl"[0] I survived the Hive and got the system into production.

One note: I was using Hive v.11. So perhaps v.12 will bring more to the table.

Warning from the drones

This raised flags with me, but I had to soilder through. Hive has a Wiki. It seems to have several. The Confluence one is authoritive, but not necessarily the highest ranked in Google. In my experience the documentation at best is incomplete. There doesn't seem to be a large community around Hive. At least it's not as easy to find as Cascading. So that was a glancing sting. If you've got the money, I recommend getting the Hive book. If you don't or prefer to not pay for something with YARN staring you in the face, well, hopefully more blog posts like this come along.

First blood -- Interacting with Hive

There are three major ways to talk with Hive. The recommended is via JDBC. But for those who enjoy a more raw client/server process, there's Hive Servers I && II. Finally there are two CLIs: hive and beeswax. Your need will influence which one or combination of interaction models you'll chose.

Presently the SQL understood by Hive's JDBC driver is query focused. You can name databases in the query such as "SELECT * FROM MY_DB.SOME_TABLE LIMIT 5;" It is not DDL focused. You cannot say "ALTER TABLE MY_DB.SOME_TABLE ..." The alter statement does not allow a database predicate. If you need to use the database and you want to use the JDBC driver, you have to specify the database in the connection URL passed to the driver and construction time.

How I was stung: my system allows the user to override the database name at runtime for DDL statements. You cannot (currently) call setCatalog in the JDBC driver and affect a change[1]. The request to get this ability is old, and apparently abandoned [2]. It is also not possible to leverage the "USE <<DATABASE>>" statement in the DDL. The parser will fail. I tried a few ways around this such as "USE DATABASE && DDL"; did not work. If you need to execute DDL against a database other than that specified on the connection URL, JDBC is not presently your friend.

With JDBC a bust for my daily needs, I switched focus to the services. Hive Server 1 is a depricated attempt at services. Hive Server 2 is the replacement. It supports multiple client executions. Either would be fine.

How I was stung: I could not find any meaningful examples of how to use either server for DDL. When sent a DDL such as "CREATE TABLE MY_DATABASE.SOME_TABLE (COL1 STRING)", the server would return a failure message because "STRING" was an unknown type. To be fair, I've never used a Thrift client before. But in my opinion the Java client should be one of the best documented versions out there.

My solution was to fall back to the command line interface from the Java code[3]. The hive command is visible on the application account. It also allows you to invoke a script using -f. Finally scripts support USE DB. Using this confluence we can generate a script like this to any necessary complexity.

USE ${db};

ALTER TABLE ${table} ${columns};

Second wave -- LOAD does not mean what you think it means

Let me start this section by taking full blame for my issues. The documentation clearly says, "Hive does not do any transformation while loading data into tables. Load operations are currently pure copy/move operations that move datafiles into locations corresponding to Hive tables."[4] This little section is easy to miss in the heady days when you're first working to get Hive up and loading under a tight deadline.

As it says Hive JUST MOVES FILES! If you are storing the data as simple text, you're golden. Move happens to work. If you use any other form of SerDe, you're in for a suprise.

How I was stung: LOADing text into an ORC backed table will corrupt the table. Data will "load" just fine. But when you run a query you'll get runtime conversion issues that prevent getting any results.

Getting around this requires double inserting the data. First we create a clone of the target table but with a basic LazySerDe. Second, use "LOAD" to move data into the temp table. Finally use "INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM temp_table;" The INSERT statement causes the SerDe that backs the final table to convert the rows from the SELECT into the its storage.

Now a short note about the performance ramifications of the double load. You will incur at least one mapper job. In my experience loading an ORC backed table in this manner results in 3 job executions per load. Play around with the backing SerDe you need to balance load vs query execution.

Here's a GIST

Third wave -- Delete? What delete?

Hive is write forward only. This makes sense since it sits on HDFS, which is write forward only as well. As a result, deleting one or a few rows requires selecting what should survive into a temp table, deleting the old table and then renaming the temp to the original table.

How I was stung: I had to delete a load file's worth of data from the table because something in it could have been wrong. Because we had to seemlessly support analytic work while loading data, the full filter/replace process above wouldn't work.

There are two possible solutions for this. The first is to back the table with HBase. This makes it possible to delete based on a query. If you chose this option you will have to run the delete at the HBase level. This means that Hive is not as IT/admin free as it could be. Also, HBase is not as fast as natively backed Hive tables. In our investigation spike it was about 25% slower than Hive lazy SerDe.

The second option leverages the only native delete like function in Hive, partitions. Hive partitions are merely an abstraction in how the data stored. Let's say your table looked was created as follows

There will be an additional set of directories under /page_view with 'country=us', for example. When you INSERT into a table with OVERWRITE and PARTITIONED data, you are telling Hive to delete the partition directory if it exists and load the new data into it.

Watch out with partition. If you're partitions are too small, you're going to have a bad time. Remember, you want to get your data to at least a block size in order to get the biggest back for your buck.

Reflecting back at it, I think Hive's biggest issue is its maturity. Missing features, missing abstractions and missing documentation all come with being young. As it stands, it should be a good tool for querying. Until it matures a bit IT/DevOps will have to play an active role in maintaining the data on a day-to-day basis.

References:

0 - http://www.imdb.com/title/tt0102492/

1 - http://svn.apache.org/viewvc/hive/trunk/jdbc/src/java/org/apache/hive/jdbc/HiveConnection.java?view=markup

2 - https://issues.apache.org/jira/browse/HIVE-2320

3 - http://docs.oracle.com/javase/7/docs/api/java/lang/Process.html

4 - https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Loadingfilesintotables

No comments:

Post a Comment