jump to navigation

Anagrams July 18, 2007

Posted by suniljagadish in Algorithms, Perl.
3 comments

What is an anagram?

Problem: Given a file containing words, re-arrange it such a way that all anagrams appear consecutively.

Example:
Given,
dcba
pqrs
zyx
cbad
xyz
sqrp
abcd

The result should be:
abcd
cbad
dcba
pqrs
sqrp
xyz
zyx

Here is a simple Perl + unix command line which does the job:

#!/usr/local/bin/perl -w

use strict;

open(INP,$ARGV[0]);

while(<INP>) {
   chomp;
   my $sig = join ”, sort split(”,$_);
   print “$sig\t$_\n”;
}

Command line: ./anagram.pl ip.txt | sort | awk -F’\t’ ‘{print $2;}’

Technique: Every word is associated with a signature. Here, a signature is nothing but the word with all the characters in it appearing in sorted order.

Here is how it looks:

abcd    abcd
abcd    cbad
abcd    dcba
pqrs    pqrs
pqrs    sqrp
xyz     xyz
xyz     zyx

Then, awk -F’\t’ ‘{print $2;}’ simply prints the original words (column 2).

Reading from a pipe in Perl April 25, 2007

Posted by suniljagadish in Perl.
add a comment

Many a times I have come across scenarios where I have to read from a pipe in my Perl script. It is scenarios where I zcat from a file into my program -

$ zcat myfile.gz | ./myscript.pl > dump.txt

This can be done pretty easily:

while(<>) {
# $_ is a special variable that stores the current line read
# Here, my field seperator is ^A
@fields = split(/001/,$_);

$_ = <>; # Read the next line, for whatever reason
if($_) {
print $_;
}

# Do anything else here
}

Indexing data using Plucene April 25, 2007

Posted by suniljagadish in Perl.
1 comment so far

If you are impressed with what Doug Cutting’s Lucene can do to your data, to enable super-fast searching, then, you can will be happy to know the existence of Plucene (a Perl port of Lucene). Off late I have been playing around with Plucene, indexing GBs of data, which I will have to query in a tight loop later.

So, why Lucene? Lucene stores data in the form of an inverted index which makes retrieval significantly faster compared to a normal indexing scheme.

A very simple example scenario where we have 3 documents containing different words. This is how a normal indexing scheme and an inverted index would index this data-

Normal Index
Doc1 – Bill Gates, Linus Torvalds, Richard Stallman
Doc2 – Steve Jobs, Scott Mc Nealy, Linus Torvalds, Bill Gates
Doc3 – Bill Gates, Steve Jobs, Larry Ellison, Scott Mc Nealy

Inverted Index
Bill Gates – Doc1, Doc2, Doc3
Linus Torvalds – Doc1, Doc2
Richard Stallman – Doc1
Steve Jobs – Doc2, Doc3
Scott Mc Nealy – Doc2, Doc3
Larry Ellison – Doc3

It is quite clear that a search for “Bill” on the inverted index will right-away return the documents in which the term appears. Whereas, in the case of a normal index, we’ll have to go through all the documents and check if the search term exists.

use Plucene::Document;
use Plucene::Document::Field;
use Plucene::Analysis::SimpleAnalyzer;

# Use the simple analyzer to tokenize the input in the default way
my $analyzer = Plucene::Analysis::SimpleAnalyzer->new();

# Create an object of Index::Writer which will write into the index
$writer = Plucene::Index::Writer->new(“/usr/local/my_index”, $analyzer, 1);

my $DocToIndex;
open(DOC, $DocToIndex);

# Create a new Document object, which will contain the fields & corresponding values
my $doc = Plucene::Document->new;

while(<DOC>) {
# Read from the file
my $line = <DOC>;
# Create a text field and store a unique ID in it. Generation of ID not shown here.
$doc->add(Plucene::Document::Field->Keyword(“id” => $id));
# Create a text field and store each line of the file in it
$doc->add(Plucene::Document::Field->text(“text” => $line));
}

# Add the document to the present index
$writer->add_document($doc);

# Merge multiple segment files created while indexing
$writer->optimize;
undef $writer;

Plucene would have now created many files in /usr/local/my_index which together forms the index. The purpose of contents of each file is described in http://lucene.apache.org/java/docs/fileformats.html

Next, let’s see how we can query this index.

my $parser = Plucene::QueryParser->new({
analyzer => Plucene::Analysis::SimpleAnalyzer->new(),
default => “text”
});

# Prepare the query – search for Bill in the text field
my $query = $parser->parse(‘text:”Bill”‘);

# Which index to search?
my $searcher = Plucene::Search::IndexSearcher->new(“/usr/local/my_index”);

my @docs;
# A callback which is called every time a search hit is found.
my $hc = Plucene::Search::HitCollector->new(collect => sub {
my ($self, $doc, $score) = @_;
push @docs, $searcher->doc($doc);
});

# Search!
$searcher->search_hc($query, $hc);

# @docs contains Document objects, so, extract only the IDs from it by mapping it to a @results array
my @results = map {
$_->get(“id”)->string;
} @docs;

# Print the ID of the documents which contained the search term
foreach my $id(@results) {
print “\nRes: “, $id;
}

This is a very simple implementation of Plucene. However it is highly scalable and many more complex applications can be built to make use of Plucene.