Sunil Jagadish

Indexing data using Plucene

with one comment

If you are impressed with what Doug Cutting’s Lucene can do to your data, to enable super-fast searching, then, you can will be happy to know the existence of Plucene (a Perl port of Lucene). Off late I have been playing around with Plucene, indexing GBs of data, which I will have to query in a tight loop later.

So, why Lucene? Lucene stores data in the form of an inverted index which makes retrieval significantly faster compared to a normal indexing scheme.

A very simple example scenario where we have 3 documents containing different words. This is how a normal indexing scheme and an inverted index would index this data-

Normal Index
Doc1 – Bill Gates, Linus Torvalds, Richard Stallman
Doc2 – Steve Jobs, Scott Mc Nealy, Linus Torvalds, Bill Gates
Doc3 – Bill Gates, Steve Jobs, Larry Ellison, Scott Mc Nealy

Inverted Index
Bill Gates – Doc1, Doc2, Doc3
Linus Torvalds – Doc1, Doc2
Richard Stallman – Doc1
Steve Jobs – Doc2, Doc3
Scott Mc Nealy – Doc2, Doc3
Larry Ellison – Doc3

It is quite clear that a search for “Bill” on the inverted index will right-away return the documents in which the term appears. Whereas, in the case of a normal index, we’ll have to go through all the documents and check if the search term exists.

use Plucene::Document;
use Plucene::Document::Field;
use Plucene::Analysis::SimpleAnalyzer;

# Use the simple analyzer to tokenize the input in the default way
my $analyzer = Plucene::Analysis::SimpleAnalyzer->new();

# Create an object of Index::Writer which will write into the index
$writer = Plucene::Index::Writer->new(“/usr/local/my_index”, $analyzer, 1);

my $DocToIndex;
open(DOC, $DocToIndex);

# Create a new Document object, which will contain the fields & corresponding values
my $doc = Plucene::Document->new;

while(<DOC>) {
# Read from the file
my $line = <DOC>;
# Create a text field and store a unique ID in it. Generation of ID not shown here.
$doc->add(Plucene::Document::Field->Keyword(“id” => $id));
# Create a text field and store each line of the file in it
$doc->add(Plucene::Document::Field->text(“text” => $line));
}

# Add the document to the present index
$writer->add_document($doc);

# Merge multiple segment files created while indexing
$writer->optimize;
undef $writer;

Plucene would have now created many files in /usr/local/my_index which together forms the index. The purpose of contents of each file is described in http://lucene.apache.org/java/docs/fileformats.html

Next, let’s see how we can query this index.

my $parser = Plucene::QueryParser->new({
analyzer => Plucene::Analysis::SimpleAnalyzer->new(),
default => “text”
});

# Prepare the query – search for Bill in the text field
my $query = $parser->parse(‘text:”Bill”‘);

# Which index to search?
my $searcher = Plucene::Search::IndexSearcher->new(“/usr/local/my_index”);

my @docs;
# A callback which is called every time a search hit is found.
my $hc = Plucene::Search::HitCollector->new(collect => sub {
my ($self, $doc, $score) = @_;
push @docs, $searcher->doc($doc);
});

# Search!
$searcher->search_hc($query, $hc);

# @docs contains Document objects, so, extract only the IDs from it by mapping it to a @results array
my @results = map {
$_->get(“id”)->string;
} @docs;

# Print the ID of the documents which contained the search term
foreach my $id(@results) {
print “\nRes: “, $id;
}

This is a very simple implementation of Plucene. However it is highly scalable and many more complex applications can be built to make use of Plucene.

Advertisements

Written by Sunil

2007.04.25 at 03:05 PM

Posted in Uncategorized

One Response

Subscribe to comments with RSS.

  1. good one.. But hardly i could understandur perl code!

    Ranganath.S

    2007.05.08 at 10:48 AM


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: