Unibz 2008/2009 Distributed Databases Final Project FAQ
created: 2009-04-30 12:00
updated: 2009-05-01 15:00
updated: 2009-05-28 16:54

What algorithms do I need to implement?

The article talks about:

Since SAT is just a special case of SM without the merge step (single tree, nothing to merge) and TDM is just TDM+C without the central collection step, those two are not very interesting. You should implement at least two of the remaining three: SM, PM or TDM+C. To get a top mark you should consider implementing all three, however.

What aggregate function do I need to implement?

Implement count(). Optionally, add others such as sum() or avg().

I don't understand how to merge the leaves?

Just think about the last step in merge sort, where two sorted arrays are merged into one (final) array. You can do that in O(n). Don't use nested loops.

I don't understand how to compute the global partition set in TDM-C?

The article does not really give a good, general algorithm. You should just do something simple, like evenly split your total interval. In the general case this is not the most efficient solution, but we start from random data that is equally distributed, so it doesn't really matter.

My benchmark results don't really make any sense. What should I do?

First of all you should be aware that embedded HSQLDB is thread-safe but not multi-threaded. If your benchmark includes the I/O step (get the data from HSQLDB into the worker threads) it will be very much skewed by HSQLDB's locking issues. One solution is to load all the data into an java.util.ArrayList and then start the worker threads that access the list in an unsynchronized way (java.util.ArrayList, like all members of the collection framework, is not thread-safe, but read access won't give you any problems). Idea credits Heinisch' group.
Next, you should note that TDM-C starts to win over PM only if you've got very high parallelism. For a very high thread count PM's merge steps become inefficient: think about 1024 threads: you need 10 iterations of pairwise merges to get the final result, each iteration must wait for the previous to complete. The article mentions that TDM-C starts to win if the core count is more than 32 cores - way more than we've got on our computers.

How do I hand in the final project?
updated

You need to discuss your work with Prof. Gamper and me (chris at 1006 dot org, +39 340 2251141). A formal presentation is not required (no slides, no report), but you should bring your notebook and be prepared to run the code and answer questions about it. All group members must be present. Also, send an archive of your source code (tar.gz or zip) to me. The discussion can be done: