Tuesday, April 20, 2010

Highlighting query in entire HTML document with Lucene and Solr

I had a not-so-peculiar requirement of highlighting the query terms in complete document. Now this document is in HTML and Lucene's hit highlighting works nicely on plain text, extracted from HTML/PDF/Word/... file. If Lucene's highlighter is used on HTML text, it may result in broken HTML. Think of one of the query term being "ref" and now in HTML it starts matching anchor tag and adding highlighting code in the "a href".

Solr has a neat utility which strips off the HTML tags while highlighting. So, I passed the html through that before sending to lucene for highlighting. It works nicely. All the phrase,
proximity, fuzzy searches are highlighted appropriately.

Here is the code outline. Sorry, I didn't take trouble of pretty formatting. Code might be broken too, as I have removed some specific references to internal code. I have used Lucene 3.0 (core jar along with lucene-highlighter-3.0.1.jar and lucene-memory-3.0.1.jar from contrib)
and Solr 1.3. As Solr is now part of Lucene, in future, you will need jars only from Lucene. Trust you find this useful.





public
static String highlightHTML(String htmlText, Query query) {
QueryScorer scorer = new QueryScorer(query, FIELD_NAME, FIELD_NAME);
SimpleHTMLFormatter htmlFormatter = new SimpleHTMLFormatter("\">", "");
Highlighter highlighter = new Highlighter(htmlFormatter, scorer);

// Nullfragmenter for highlighting entire document.
highlighter
.setTextFragmenter(new NullFragmenter());

    TokenStream ts = analyzer.tokenStream(FIELD_NAME, new HTMLStripReader(new StringReader(htmlText)));

try {
String highlightedText = highlighter.getBestFragment(ts, htmlText);
if (highlightedText != null) {
return highlightedText;
}
} catch (Exception e) {
LOG
.error("Failed to highlight query string "+ query, e);
}
return htmlText;
}


Update:  The code given above failed for certain html documents. The highlighting was broken. These documents were the HTML generated by Microsoft Office (Word, Outlook, etc.) These documents were large and they had lot of formatting. This formatting was  done with CSS that was in the head tag of the html document. The HTMLStripReader looks ahead to make a call on current tag. For example, when it encounters opening style tag, it looks for appropriate closing style tag. and then removes the entire block as it is not the indexable text of the document. This limit, by default is 8K. That is if it find the ending style tag in less than 8K characters after it found opening style tag, it will be happy. Else it will treat the CSS code in style tag as html code. And then things go downhill from there. 
Fix for this is fairly straight-forward. Increasing the lookahead limit fixes this. For example, to increase the limit to 64K, change the HTMStripReader as follows.
    TokenStream ts = analyzer.tokenStream(FIELD_NAME, new HTMLStripReader(new StringReader(htmlText), Collections.EMPTY_SET, 65536));


Labels: ,


This page is powered by Blogger. Isn't yours?

Subscribe to Posts [Atom]