Tuesday, 15 May 2012

Pairs of entities in a NoSQL database

In this blog post we address the problem of many-to-many self references in a relational database and how one can handle it using caching with emphasis on the theoretical and implementational aspects of the program using Java and MySQL.

Let us assume that you need to register some entities in your database (e.g. Person) and pairs of them (e.g. Friendship). 

Respective ER Diagram




Let us see how this can be realized in a relational database. A first approach might be to just create a table with a composite primary key of the form (`person1`,`person2`) where both are foreign keys to a Person. This is not a good approach as duplicates will appear in the database. e.g. If John is a friend of Mike's, there is no guarantee that both pairs ('John','Mike') and the symmetric ('Mike','John') will appear in the database. For that, such pairs need to be considered as equal.

DB Structure
One way to achieve this is by hashes (although, not the only one). If one wants to avoid hashing and all the problems it brings about, they will have to control registration of new entries from within the program that makes the transactions, but this is not the point I want to make here. So, we need a hashing algorithm that maps (A,B) and (B,A) to the same hash key. The first step for this  is to order the pair (A,B). If the person's ID is a string we can use the comparator based on the (case sensitive!) alphabetical order. If it is numeric, it is pretty straight forward to order them. Afterwards we concatenate these two strings into one using some sort of properly chosen separator. In order to figure out what this separator can be, let's take a look at the following example:

Consider the pair (A,B)=('abc','def') and the pair (X,Y)=('ab','cdef'). Ordering and concatenating both of them yields 'abcdef'. This given, we have a clear conflict which can easily blow up our database. If you choose as a separator the character 'd' then the first one will become 'abcddef'. Taking a careful look we see that cannot really resolve the initial tokens... are they ('abcd','ef') or ('abc','def')? Well, you will argue that 'd' was a deliberate counterexample... you would never choose such a separator! For sure a blank space, or a colon would do a much better job! Or a double colon or something more complicated anyway. However, if you want a really good separator you need a character that is not available on any keyboard. In Java such character can be made up using:

String separator = new String(1);

So, up to now your code should look more or less as follows:

 // Your Keys  
 List<String> keys = new ArrayList<String>();  
 //TODO: Populate the list... 
 // Sort the elements of the list  
 Collections.sort(keys, String.CASE_INSENSITIVE_ORDER);
 String separator = new String(1);  
 StringBuilder sb = new StringBuilder();  
   
 for (String s : compoundKeys) {  
  sb.append(s);  
  sb.append(separator);  
 }  

Afterwards you can create a pretty unique hash for your combination of keys using the class MessageDigest in Java:

 MessageDigest md = MessageDigest.getInstance("MD5");  
 md.reset();  
 md.update(sb.toString().getBytes("UTF-8"));  
 byte[] digest = md.digest();  
 BigInteger bigInt = new BigInteger(1, digest);  
 digestAsString = bigInt.toString(16);  
 while (digestAsString.length() < 32) {  
  digestAsString = "0" + digestAsString;  
 }  

MD5 should be fine for most of the purposes, although there are better algorithms. In all cases, when using a hash key you should no that there is always a chance that there would be a conflict (i.e. two different initial objects having the same hash key). What you gain with hashing is that your primary key will be shorter than the initial string.

No comments:

Post a Comment