Amazing Algorithms with NoSQL: A MongoDB Example Part 2

jieforest發表於2012-06-21
In part 1 of this article, I described the use of MongoDB to solve a specific Chemoinformatics problem, namely the computation of molecular similarities. Depending on the target Tanimoto coefficient, the MongoDB solution is able to screen a database of a million compounds in subsecond time.

To make this possible, queries only return chemical compounds which, in theory, are able to satisfy the particular target Tanimoto. Even though this optimization is in place, the number of compounds returned by this query increases significantly when the target Tanimoto is lowered.

The example code on the GitHub repository for instance, imports and indexes ~25000 chemical compounds. When a target Tanimoto of 0.8 is employed, the query returns ~700 compounds. When the target Tanimoto is lowered to 0.6, the number of returned compounds increases to ~7000.

Using the MongoDB explain functionality, one is able to observe that the internal MongoDB query execution time increases slightly, compared to the execution overhead to transfer the full list of 7000 compounds to the remote Java application. Hence, it would make more sense to perform. the calculations local to where the data is stored. Welcome to MongoDB’s build-in map-reduce functionality!


1. MongoDB molecular similarity map-reduce query

Map-reduce is a conceptual framework, introduced by Google, to enable the processing of huge datasets using a large number of processing nodes. The general idea is that a larger problem is divided in a set of smaller subproblems that can be answered (i.e. solved) by an individual processing node (the map-step).

Afterwards, the individual solutions are combined again to produce the final answer to the larger problem (the reduce-step). By making sure that the individual map and reduce steps can be computed independently of each other, this divide-and-conquer technique can be easily parallelized on a cluster of processing nodes. Let’s start by refactoring our solution to use MongoDB’s map-reduce functionality.

CODE:

01.te the essential numbers
02.int maxnumberofcompoundfingerprints = (int) (fingerprintsToFind.size() / 0.6);
03.int minnumberofcompoundfingerprints = (int) (fingerprintsToFind.size() * 0.6);
04.int numberoffingerprintstoconsider = fingerprintsToFind.size() - minnumberofcompoundfingerprints;
05.
06.List fingerprintsToConsider = fingerprintsToFind.subList(0,numberoffingerprintstoconsider+1);
07.
08.// Find all compounds that satisfy the specified conditions
09.DBObject compoundquery =
10.QueryBuilder.start(FINGERPRINTS_PROPERTY).in(fingerprintsToConsider)
11..and(FINGERPRINTCOUNT_PROPERTY).lessThanEquals(maxnumberofcompoundfingerprints)
12..and(FINGERPRINTCOUNT_PROPERTY).greaterThanEquals(minnumberofcompoundfingerprints)
13..get();
14.
15.// The map fuction
16.String map = "function() { " +
17."var found = 0; " +
18."var fingerprintslength = this.fingerprints.length; " +
19."for (i = 0; i < fingerprintslength; i++) { " +
20."if (fingerprintstofind[this.fingerprints[i]] === true) { found++; } " +
21."} " +
22."if (found >= minnumberofcompoundfingerprints) { emit (this.compound_cid, {found : found, total : this.fingerprint_count} ); } " +
23."}";
24.
25.// Execute the map reduce function
26.MapReduceCommand mr = new MapReduceCommand(compoundsCollection, map, "", null, MapReduceCommand.OutputType.INLINE, compoundquery);
27.
28.// Create a hashmap for the fingerprints to find (to speed up the javascript. execution)
29.Map tofind = new HashMap();
30.for(String fingerprinttofind : fingerprintsToFind) {
31.tofind.put(fingerprinttofind,true);
32.}
33.
34.// Set the map reduce scope
35.Map scope = new HashMap();
36.scope.put("fingerprintstofind",tofind);
37.scope.put("minnumberofcompoundfingerprints",minnumberofcompoundfingerprints);
38.mr.setScope(scope);
39.
40.// Execute the map reduce
41.MapReduceOutput ut = compoundsCollection.mapReduce(mr);
42.
43.// Iterate the results
44.for (DBObject result : out.results()) {
45.String compound_cid = (String)result.get("_id");
46.DBObject value = (DBObject)result.get("value");
47.
48.// Calculate the tanimoto coefficient
49.double totalcount = (Double)value.get("total");
50.double found = (Double)value.get("found");
51.double tanimoto = (Double)value.get("found") / ((Double)value.get("total") + fingerprintsToFind.size() - (Double)value.get("found"));
52.// We still need to check whether the tanimoto is really >= the required similarity
53.if (tanimoto >= 0.6) {
54.System.out.println(compound_cid + " " + (int)(tanimoto * 100) +"%");
55.}

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/301743/viewspace-733548/,如需轉載,請註明出處,否則將追究法律責任。

相關文章