You are here: Home » About » Technology » Connexions Tech Blog » ...and the changes you can't

...and the changes you can't

2004-11-16
Behind the face of the new look lies a completely re-implemented similarity engine. If you're interested in technical details, and an insight into how these things came about, read on.

You may first want to read my announcement of the new look that took place a couple weeks ago

Why change it?

Our old similarity calculator simply re-used our search engine: it took the keywords of the object in question (or the title, if the object had no keywords) and pretended you had done a search using those words as search terms. It seemed a reasonable heuristic approach, and was definitely easy to implement! The problem was that the new look called for the top three similarity results to be displayed directly on the module, and our search engine ran a bit too slowly for that. When we hooked it up, searching for similar content was a substantial fraction of total module rendering time. Not good.

My first reaction was to speed up searching. After all, our logs showed that it was the single most requested page by a large factor. Ross's suggestion was to cache the similarity results. He ended up being right, but I did convince him to eke a bit more performance out of the search code.

So the plan is simple, right? Just cache the similarity results? Well, not quite so simple....

Symmetric vs. Asymmetric

Pre-computing similarities was a simple task: just run the similarity search at content publication time and store the results, right? Well, not exactly. Doing that would find items similar to that content at the time it was published. In a dynamic, continually evolving repository more similar things might appear later that ought to be linked. How to keep the similarity up to date? If your similarity measure is symmetric ( sim(A,B) == sim(B,A) ), then the answer is simple: when you store the similarity list for B, you also go and insert B into the similarity list for A. But if your similarity measure is asymmetric then you potentially need to recompute similarities for the entire repository every time something is published. Again, not good. And guess which kind of similarity we had. Yup, asymmetric.

So in addition to creating a cache for similarity results, we were now faced with the task of redefining how to compute similarity. We could have looked at the latest research on latent semantic analysis and some up with something cutting-edge, but didn't want to delay the new look and feel unnecessarily. So I suggested we take the cheap-shot mathematical approach to making a measure symmetric: take the average of forward and reverse similarities (or just the sum, since the factor of two doesn't really matter).

The old similarity measure (done via search) compared the object's keywords to the title, keywords, and body text of other objects in the repository. The new measure does that plus compares the object's title, keywords, and body text to the keywords of the other objects in the repository. The best way to visualize it is the matrix that Ross came up with for measuring the similarity of B relative to A: Similarity grid

The numbers indicate the relative score when a match is found in that cell (no number means a score of zero). Clearly the new matrix is symmetric whereas the old one is not. Visualizing similarity score this way emphasizes the fact that our similarity measure could probably stand some improvement. For one thing, we're missing out on 4 potential comparisons. For another, we probably have our scores backwards. Content with matching body text should probably be weighted higher than content with matching titles. Another improvement might be to add more rows/columns, for example the content's abstract.

Theory vs. Practice

OK, so we've redefined similarity, onto the "easy" part. Well, maybe not quite so easy. Our first storage design took advantage of our newfound symmetry by only storing the similarity between two objects once. We used a SQL table with five columns: source id, source version, target id, target version, strength (we have to refer to objects by (id, version) pair because courses, being stored in the ZODB, do not have a single identifier the way modules do. We should no doubt fix this problem in a future version of our repository software). This seemed a good approach, but the structure made for some performance problems. We expected similarity calculations to be slow, which was acceptable since we were precalculating at publication time. We didn't expect similarity retrieval times to be slow, however. The culprit turned out to be our clever table structure: to discover the similarities for a given object you had to scan both the source and the target columns. Since the initial point of this exercise was to speed up similarity retrieval for displaying modules, we decided to try something new.

For a couple months now, we've been taking advantage of PostgreSQL's support for SQL ARRAYs to store roles on modules. We decided to use them again to store similarity information. We even pushed it a little bit and used ARRAYs of ARRAYs: So if m10000 is similar to modules m10001 (version 1.1) and m10002 (version 1.5) we would store:

{ {'m10001','1.1',100}, {'m10002', '1.5', 20}}

Versions: an aside

We actually calculate similarity against all versions of content in the repository to give people the most opportunities to find what they're looking for. We made a slight optimization in storage though. Although we store a similarity row for every version of an object, in the similarity array we only refer to the latest similar version. So even though m10000 also happens to be similar to versions 1.2, 1.3, and 1.4 of m10002, they do not show up in the row for m10000, but m10000 does show up in their rows. So even though our similarity is symmetric, our storage of it is not quite. This optimization matches our use cases: we need to be able to retrieve the similarity data for all versions of objects because they are all viewable online, but we only want to list the most recent similarity links. Most readers would rather see 5 different modules that a page is similar to, not 5 different versions of a single module.

This structure was blindingly fast for reads since it only has to retrieve a single row. Writes, however, were even slower. Primarily this is because now after storing the similarity entry for a new content object, we have to go through all of the rows for objects similar to it and insert the new object. In principle, it's acceptable to have worse performance for writes than reads since there are fewer writes. Unfortunately, it was too slow. We could no longer perform it "inline" during the publish step, but needed to launch a separate process to perform the calculations. There may have been a way to do it using some sort of Zope scheduler like Xron, but we decided just to fork and execute a new zopectl run process. So far it seems to be working. The overhead of starting a new Zope instance is somewhat troubling, but it's not the bottleneck.

The last bit of reality to intrude upon our similarity designs: we noticed that certain similarity calculations end up not being symmetric. We eventually tracked this down to differences in how PostgreSQL and Zope perform searches. We spent some time hacking at the Zope Catalog trying to fix this. Doing some manual punctuation/stop word removal and using Dieter Maurer's ManagableIndex to gain more flexibility helped us fix most of the problems, but it's a little like playing Whack-A-Mole.

Conclusions

The similarity revamp ended up taking longer and being more involved than expected (rather like this blog entry!) but I'm pleased with the way it turned out. It solved the original problem of speeding up module rendering and also provides some avenues for future development. Now that its done I get to work on another cool new piece (and this one has higher user visibility) - the Word importer and Edit-In-Place enhancements.