Anagrams July 18, 2007
Posted by suniljagadish in Algorithms, Perl.3 comments
Problem: Given a file containing words, re-arrange it such a way that all anagrams appear consecutively.
Example:
Given,
dcba
pqrs
zyx
cbad
xyz
sqrp
abcd
The result should be:
abcd
cbad
dcba
pqrs
sqrp
xyz
zyx
Here is a simple Perl + unix command line which does the job:
#!/usr/local/bin/perl -w
use strict;
open(INP,$ARGV[0]);
while(<INP>) {
chomp;
my $sig = join ”, sort split(”,$_);
print “$sig\t$_\n”;
}
Command line: ./anagram.pl ip.txt | sort | awk -F’\t’ ‘{print $2;}’
Technique: Every word is associated with a signature. Here, a signature is nothing but the word with all the characters in it appearing in sorted order.
Here is how it looks:
abcd abcd
abcd cbad
abcd dcba
pqrs pqrs
pqrs sqrp
xyz xyz
xyz zyx
Then, awk -F’\t’ ‘{print $2;}’ simply prints the original words (column 2).
Reading from a pipe in Perl April 25, 2007
Posted by suniljagadish in Perl.add a comment
Many a times I have come across scenarios where I have to read from a pipe in my Perl script. It is scenarios where I zcat from a file into my program -
$ zcat myfile.gz | ./myscript.pl > dump.txt
This can be done pretty easily:
while(<>) {
# $_ is a special variable that stores the current line read
# Here, my field seperator is ^A
@fields = split(/001/,$_);$_ = <>; # Read the next line, for whatever reason
if($_) {
print $_;
}# Do anything else here
}
Indexing data using Plucene April 25, 2007
Posted by suniljagadish in Perl.1 comment so far
If you are impressed with what Doug Cutting’s Lucene can do to your data, to enable super-fast searching, then, you can will be happy to know the existence of Plucene (a Perl port of Lucene). Off late I have been playing around with Plucene, indexing GBs of data, which I will have to query in a tight loop later.
So, why Lucene? Lucene stores data in the form of an inverted index which makes retrieval significantly faster compared to a normal indexing scheme.
A very simple example scenario where we have 3 documents containing different words. This is how a normal indexing scheme and an inverted index would index this data-
Normal Index
Doc1 – Bill Gates, Linus Torvalds, Richard Stallman
Doc2 – Steve Jobs, Scott Mc Nealy, Linus Torvalds, Bill Gates
Doc3 – Bill Gates, Steve Jobs, Larry Ellison, Scott Mc Nealy
Inverted Index
Bill Gates – Doc1, Doc2, Doc3
Linus Torvalds – Doc1, Doc2
Richard Stallman – Doc1
Steve Jobs – Doc2, Doc3
Scott Mc Nealy – Doc2, Doc3
Larry Ellison – Doc3
It is quite clear that a search for “Bill” on the inverted index will right-away return the documents in which the term appears. Whereas, in the case of a normal index, we’ll have to go through all the documents and check if the search term exists.
use Plucene::Document;
use Plucene::Document::Field;
use Plucene::Analysis::SimpleAnalyzer;# Use the simple analyzer to tokenize the input in the default way
my $analyzer = Plucene::Analysis::SimpleAnalyzer->new();# Create an object of Index::Writer which will write into the index
$writer = Plucene::Index::Writer->new(“/usr/local/my_index”, $analyzer, 1);my $DocToIndex;
open(DOC, $DocToIndex);# Create a new Document object, which will contain the fields & corresponding values
my $doc = Plucene::Document->new;while(<DOC>) {
# Read from the file
my $line = <DOC>;
# Create a text field and store a unique ID in it. Generation of ID not shown here.
$doc->add(Plucene::Document::Field->Keyword(“id” => $id));
# Create a text field and store each line of the file in it
$doc->add(Plucene::Document::Field->text(“text” => $line));
}# Add the document to the present index
$writer->add_document($doc);# Merge multiple segment files created while indexing
$writer->optimize;
undef $writer;
Plucene would have now created many files in /usr/local/my_index which together forms the index. The purpose of contents of each file is described in http://lucene.apache.org/java/docs/fileformats.html
Next, let’s see how we can query this index.
my $parser = Plucene::QueryParser->new({
analyzer => Plucene::Analysis::SimpleAnalyzer->new(),
default => “text”
});# Prepare the query – search for Bill in the text field
my $query = $parser->parse(‘text:”Bill”‘);# Which index to search?
my $searcher = Plucene::Search::IndexSearcher->new(“/usr/local/my_index”);my @docs;
# A callback which is called every time a search hit is found.
my $hc = Plucene::Search::HitCollector->new(collect => sub {
my ($self, $doc, $score) = @_;
push @docs, $searcher->doc($doc);
});# Search!
$searcher->search_hc($query, $hc);# @docs contains Document objects, so, extract only the IDs from it by mapping it to a @results array
my @results = map {
$_->get(“id”)->string;
} @docs;# Print the ID of the documents which contained the search term
foreach my $id(@results) {
print “\nRes: “, $id;
}
This is a very simple implementation of Plucene. However it is highly scalable and many more complex applications can be built to make use of Plucene.