Sunil Jagadish

Archive for April 2007

Smart Device Apps not smart yet on Vista

with 2 comments

After hunting around for a way to get internet connectivity on my WM 5.0 PPC Emulator, I came across an informative thread on MSDN Forums. The two suggestions given there didn’t work for me. Unfortunately the PPC Emulator cannot be cradled in Vista’s Mobile Device Center yet, which is sad. When is a fix coming for this? I hope I won’t have to wait for Orcas. [Ref: VSD team blog]


Written by Sunil

2007.04.28 at 06:36 PM

Posted in Uncategorized

Indexing data using Plucene

with one comment

If you are impressed with what Doug Cutting’s Lucene can do to your data, to enable super-fast searching, then, you can will be happy to know the existence of Plucene (a Perl port of Lucene). Off late I have been playing around with Plucene, indexing GBs of data, which I will have to query in a tight loop later.

So, why Lucene? Lucene stores data in the form of an inverted index which makes retrieval significantly faster compared to a normal indexing scheme.

A very simple example scenario where we have 3 documents containing different words. This is how a normal indexing scheme and an inverted index would index this data-

Normal Index
Doc1 – Bill Gates, Linus Torvalds, Richard Stallman
Doc2 – Steve Jobs, Scott Mc Nealy, Linus Torvalds, Bill Gates
Doc3 – Bill Gates, Steve Jobs, Larry Ellison, Scott Mc Nealy

Inverted Index
Bill Gates – Doc1, Doc2, Doc3
Linus Torvalds – Doc1, Doc2
Richard Stallman – Doc1
Steve Jobs – Doc2, Doc3
Scott Mc Nealy – Doc2, Doc3
Larry Ellison – Doc3

It is quite clear that a search for “Bill” on the inverted index will right-away return the documents in which the term appears. Whereas, in the case of a normal index, we’ll have to go through all the documents and check if the search term exists.

use Plucene::Document;
use Plucene::Document::Field;
use Plucene::Analysis::SimpleAnalyzer;

# Use the simple analyzer to tokenize the input in the default way
my $analyzer = Plucene::Analysis::SimpleAnalyzer->new();

# Create an object of Index::Writer which will write into the index
$writer = Plucene::Index::Writer->new(“/usr/local/my_index”, $analyzer, 1);

my $DocToIndex;
open(DOC, $DocToIndex);

# Create a new Document object, which will contain the fields & corresponding values
my $doc = Plucene::Document->new;

while(<DOC>) {
# Read from the file
my $line = <DOC>;
# Create a text field and store a unique ID in it. Generation of ID not shown here.
$doc->add(Plucene::Document::Field->Keyword(“id” => $id));
# Create a text field and store each line of the file in it
$doc->add(Plucene::Document::Field->text(“text” => $line));

# Add the document to the present index

# Merge multiple segment files created while indexing
undef $writer;

Plucene would have now created many files in /usr/local/my_index which together forms the index. The purpose of contents of each file is described in

Next, let’s see how we can query this index.

my $parser = Plucene::QueryParser->new({
analyzer => Plucene::Analysis::SimpleAnalyzer->new(),
default => “text”

# Prepare the query – search for Bill in the text field
my $query = $parser->parse(‘text:”Bill”‘);

# Which index to search?
my $searcher = Plucene::Search::IndexSearcher->new(“/usr/local/my_index”);

my @docs;
# A callback which is called every time a search hit is found.
my $hc = Plucene::Search::HitCollector->new(collect => sub {
my ($self, $doc, $score) = @_;
push @docs, $searcher->doc($doc);

# Search!
$searcher->search_hc($query, $hc);

# @docs contains Document objects, so, extract only the IDs from it by mapping it to a @results array
my @results = map {
} @docs;

# Print the ID of the documents which contained the search term
foreach my $id(@results) {
print “\nRes: “, $id;

This is a very simple implementation of Plucene. However it is highly scalable and many more complex applications can be built to make use of Plucene.

Written by Sunil

2007.04.25 at 03:05 PM

Posted in Uncategorized