Tuesday, April 20, 2010

Highlighting query in entire HTML document with Lucene and Solr

I had a not-so-peculiar requirement of highlighting the query terms in complete document. Now this document is in HTML and Lucene's hit highlighting works nicely on plain text, extracted from HTML/PDF/Word/... file. If Lucene's highlighter is used on HTML text, it may result in broken HTML. Think of one of the query term being "ref" and now in HTML it starts matching anchor tag and adding highlighting code in the "a href".

Solr has a neat utility which strips off the HTML tags while highlighting. So, I passed the html through that before sending to lucene for highlighting. It works nicely. All the phrase,
proximity, fuzzy searches are highlighted appropriately.

Here is the code outline. Sorry, I didn't take trouble of pretty formatting. Code might be broken too, as I have removed some specific references to internal code. I have used Lucene 3.0 (core jar along with lucene-highlighter-3.0.1.jar and lucene-memory-3.0.1.jar from contrib)
and Solr 1.3. As Solr is now part of Lucene, in future, you will need jars only from Lucene. Trust you find this useful.





public
static String highlightHTML(String htmlText, Query query) {
QueryScorer scorer = new QueryScorer(query, FIELD_NAME, FIELD_NAME);
SimpleHTMLFormatter htmlFormatter = new SimpleHTMLFormatter("\">", "");
Highlighter highlighter = new Highlighter(htmlFormatter, scorer);

// Nullfragmenter for highlighting entire document.
highlighter
.setTextFragmenter(new NullFragmenter());

    TokenStream ts = analyzer.tokenStream(FIELD_NAME, new HTMLStripReader(new StringReader(htmlText)));

try {
String highlightedText = highlighter.getBestFragment(ts, htmlText);
if (highlightedText != null) {
return highlightedText;
}
} catch (Exception e) {
LOG
.error("Failed to highlight query string "+ query, e);
}
return htmlText;
}


Update:  The code given above failed for certain html documents. The highlighting was broken. These documents were the HTML generated by Microsoft Office (Word, Outlook, etc.) These documents were large and they had lot of formatting. This formatting was  done with CSS that was in the head tag of the html document. The HTMLStripReader looks ahead to make a call on current tag. For example, when it encounters opening style tag, it looks for appropriate closing style tag. and then removes the entire block as it is not the indexable text of the document. This limit, by default is 8K. That is if it find the ending style tag in less than 8K characters after it found opening style tag, it will be happy. Else it will treat the CSS code in style tag as html code. And then things go downhill from there. 
Fix for this is fairly straight-forward. Increasing the lookahead limit fixes this. For example, to increase the limit to 64K, change the HTMStripReader as follows.
    TokenStream ts = analyzer.tokenStream(FIELD_NAME, new HTMLStripReader(new StringReader(htmlText), Collections.EMPTY_SET, 65536));


Labels: ,


Wednesday, October 14, 2009

Hadoop Error : "ls: Cannot access .: No such file or directory."

When the server was switched, and I started a fresh hadoop (psedo) cluster on the new machine, I was greeted with following error.

"ls: Cannot access .: No such file or directory."

In the log, I found an exception

"org.apache.hadoop.ipc.RemoteException: java.io.IOException: File could only be replicated to 0 nodes, instead of 1"

I went on the goosehunt to see if the causes is multiple NICs etc. But no banana.

Since, I was using the default configuration, the data directory was configured on /tmp. There was a possibility that some restrictions on the /tmp were placed. I changed the data location to my home area and recreated the cluster. The logs didn't show any exception, but "ls" command still refused to show anything meaningful.

After browsing what seemed like a million pages, I jumped on this page, where Philip had written this:
"hadoop dfs -ls" defaults to /user/, so if that directory doesn't exist, it'll show you that error. Try "hadoop dfs -ls /".
Voila. That was true. "hadoop fs -ls /" showed me "/tmp" as the only location. I created /user/ directory, and now "hadoop fs -ls" started behaving correctly.

So, if you have got this error, before you look at any exceptions, try this out. It should save your few hours.

Labels:


Wednesday, December 03, 2008

Firefox making additional request for feed on a page

We recently (few hours back, to be precise) enabled RSS feeds for our service. For each search result page we provide a link to RSS feed with HTML's link tag in the head as follows.

<link rel="alternate" type="application/rss+xml" href="/feed/[FEED-URL]" />

While accessing these pages, I also observed the access log of webserver. Mysteriously, I found the requests to the feed URLs in the log.

This was observation with Firefox. To confirm the problem is not in my code, I used Live HTTP Headers Firefox add-on to see the requests being sent. Firefox was indeed requesting the feed URL. And if the page had multiple RSS feed links (like GigaOm), Firefox fetches all those links.

I accessed the same page with IE. The requests to feed URLs just weren't there. Even Chrome was not sending any requests for feed URLs.

The only reason to justify this behaviour of Firefox is to check if the RSS icon for the page needs to be shown in the address bar. But, I fail to understand why Firefox just doesn't trust the publisher.

But this is wrong on so many counts. First, it adds latency for the user. Secondly, it wastes bandwidth - of user and the publisher. Third, it adds unnecessary load on the server,

In a nut shell, this is a serious bug.

Labels:


Thursday, October 16, 2008

Javamail : Using SMTP server with SSL (Secure)

One of the major drawbacks of EC2 is the static IPs with history. Most of the services treat these IPs as dynamic IPs. The EC2 infrastructure makes in very easy for somebody to set up a server to send spam. Of course, Amazon is quite vigilant about it and takes it very seriously.

Now the IP address yo are using is most probably blacklisted. What do you do? Use external SMTP server.

Coming to the topic of the post. Sending mails via SMTP over SSL in Java. JavaMail looks to the widely used solution. I tried many examples on net, but couldn't it right. Everybody claimed "it should just work" and it just didn't. After going through documentation and the sample program, I manged to do it. The external documentation says the code has a bug. (If there is one, why can't you just fix it?) Here is the basic snippet to do this.



Properties props = new Properties();

props.put("mail.smtps.host", SMTP_HOST_NAME);
props.put("mail.smtps.port", Integer.toString(SMTP_PORT));
props.put("mail.smtps.auth", "true");

// Secure session
Session ssession = Session.getInstance(props, null);
ssession.setDebug(true);

Message msg = new MimeMessage(ssession);

msg.setFrom(new InternetAddress(FROM_ADDRESS));
InternetAddress[] address = {new InternetAddress(TO_ADDRESS, false)};
msg.setRecipients(Message.RecipientType.TO, address);
msg.setSubject("Test E-Mail through Java " + new Date());
msg.setSentDate(new Date());
// Set message content
msg.setText("This is a test of sending a " text e-mail through Java.\n");

//Send the message
SMTPTransport t = (SMTPTransport)ssession.getTransport("smtps");
t.connect(SMTP_HOST_NAME, SMTP_AUTH_USER, SMTP_AUTH_PWD);
t.sendMessage(msg, msg.getAllRecipients()); // !!! Note this.

Labels: , , ,


Google now understands forums.

Try this google search link : Javamail smtp nabble". In the results you can see number of posts on a thread and the date of last post.

To say this is useful, is an understatement.

Labels:


Monday, September 29, 2008

Yahoo's home page not optimized for performance.

Well, that's not my verdict. It is the report card given by, err, Yahoo. See here.



Friday, September 26, 2008

S3. Should you use subdomain or directory for bucket?

When you create a bucket on S3 to host media content for public access there are two ways you can create the URL.

http://bucket-name.s3.amazonaws.com/

Or

http://s3.amazonaws.com/bucket-name/

Sometimes people name their buckets with their domain name, which just looks good. Otherwise it is same as first or second one.

Here is my unscientific observation done via wget. The method of using sub-domain is slower.

Here is why.

If bucket-name.s3.amazonaws.com domain is not extremely popular, name resolution takes some time. The subdomain s3.amazonaws.com is few orders of magnitude (think thousands of times) popular than the ones with bucketname prefixed to it, domain resolution happens very quickly.

Of course, once that domain is found, the subsequent requests should be faster with DNS caching. But this problem will again arise when the DNS is flushed.

Tell me how wrong I am on this.

Labels: ,


Wednesday, September 24, 2008

JSP page buffer size fix for slow page generation/loading

If you think, a JSP page is taking lot of time to load, here is one of the possible reason. The default buffer size for the jsp page is 32k. When this buffer gets full, the buffer is pushed on the net. Sometimes it might be desirable to flush buffers as it doesn't leave user wondering what's happening. But what if you were just close to complete the page? This is what happened with us.

<%@page buffer="64kb" %>

When we pushed up the buffer size, a delay of around 800 ms just vanished.

Labels:


This page is powered by Blogger. Isn't yours?

Subscribe to Posts [Atom]