Wednesday, January 6, 2010

Anagram program implentation

#include 
using namespace std;

#include "anaword.h"

static const int ALPH_SIZE = 26;

Anaword::Anaword(const string & word)
: myWord(word),
myCounts(ALPH_SIZE,0)
// postcondition: constructed
{
normalize();
}

Anaword::Anaword()
: myWord(""),
myCounts(ALPH_SIZE,0)
{

}

void Anaword::normalize()
// postcondition: myCounts represents the letter signture of myWord
{

}

string Anaword::toString() const
// postcondition: return "bagel" or "gable", regular form of string
{
return myWord;
}

ostream & operator << (ostream & out, const Anaword & a)
// postcondition: a printed t stream out, out returned
{
out << a.toString();
return out;
}

bool Anaword::equal(const Anaword & rhs) const
// postcondition: returns true if and only if *this == rhs
// canonical/normalized form of word used for comparisons
{
return false;
}

bool operator == (const Anaword & lhs, const Anaword & rhs)
// postcondition: returns true if and only if lhs == rhs
{
return lhs.equal(rhs);
}

bool operator != (const Anaword & lhs, const Anaword & rhs)
// postcondition: returns true if and only if lhs != rhs
{
return ! lhs.equal(rhs);
}


bool Anaword::less(const Anaword & rhs) const
// postcondition: returns true if and only if *this < rhs
// canonical/normalized form of word used for comparison
{
return false;
}

bool operator < (const Anaword & lhs,const Anaword & rhs)
// postcondition: returns true if and only if *this < rhs
{
return lhs.less(rhs);
}

bool operator <= (const Anaword & lhs,const Anaword & rhs)
// postcondition: returns true if and only if *this <= rhs
{
return ! rhs.less(lhs);
}

Anagram class

#ifndef _ANAWORD_H
#define _ANAWORD_H

#include
#include
using namespace std;
#include "tvector.h"

// Used for finding anagrams: words with the same letters
// but which are different words, e.g., "bagel" a "gable"
// author: Owen Astrachan
//
// an Anaword object prints as a regular string, but
// compares using a normalized (also called canonicalized) form
//
// Example: the Anaword version of the string "bagel"
// prints as bagle, but will be compared with
// other Anawords as a vector of counts
// of one 'a', one 'b', one 'e', one 'g', one 'l'
// Since the counts for "gable" are the same, "gable"
// and "bagel" will be equal when compared using operator ==
//
// basically an Anaword takes a string and converts it to a twenty-six
// digit number based on the counts of a's, b's, c's, ... z's so that
// aardvark is represented as:
//
// 3 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 1 0 0 0 0
//
//
// operations:
//
// Anaword(const string & word) -- construct from a string
//
// bool equal(const Anaword & rhs) -- compare rhs for equality
// bool operator == (lhs, rhs) -- compare Anawords lhs == rhs
//
// bool less(const Anaword & rhs) -- compare rhs for inequality <
// bool operator < (lhs,rhs) -- compare Anawords lhs < rhs
// bool operator <= (lhs,rhs) -- compare Anawords lhs <= rhs
//
// string toString() -- returns uncanonicalized "bagel"
// ostream & << operator(ostream, -- print using <<
// Anaword)

class Anaword
{
public:
Anaword(const string & word); // construct from string
Anaword(); // default (for vector)
bool equal(const Anaword & rhs) const; // compare for ==
bool less(const Anaword & rhs) const; // compare for <
string toString() const; // return "bagel" or "gable"

private:

void normalize(); // helper function, sorts
string myWord; // regular string: "bagel"
tvector myCounts; // canonicalized form
};

bool operator == (const Anaword & lhs, const Anaword & rhs);
bool operator != (const Anaword & lhs, const Anaword & rhs);
bool operator < (const Anaword & lhs, const Anaword & rhs);
bool operator <= (const Anaword & lhs, const Anaword & rhs);
ostream & operator << (ostream & out, const Anaword & a);

#endif 
 
#include 
#include
#include
using namespace std;

#include "anaword.h"
#include "prompt.h"
#include "tvector.h"
#include "sortall.h"



void FindAnagrams(tvector& list)
// pre: list contains list.size() elements
// post: all anagrams in list printed, one set of anagrams per line
{
QuickSort(list,list.size());
}

int main(int argc, char * argv[])
{
tvector list;
ifstream input;
string filename,word;
// use command line argument if it exists, else prompt user

if (argc > 1)
{
filename = argv[1];
}
else
{
filename = PromptString("enter file name ");
}

input.open(filename.c_str());
if (input.fail())
{
cerr << "could not open " << filename << endl;
exit(1);
}

while (input >> word)
{
list.push_back(Anaword(word));
}
cout << endl << "read " << list.size() << " words" << endl;

FindAnagrams(list);

return 0;
}
 
Do not run the program as it is (without implementing) Anaword::equal and Anaword::less since the call to QuickSort will cause an infinite loop.

Program Description

You'll write a program, part of which is given to you, that reads a file of words and generates as output all the anagrams in the file. Each line of output should contain words that are anagrams of each other, for example:
gazer graze gases sages heals leash shale heaps phase shape hares hears share shear earth hater heart haste hates heats haves shave arise raise lakes leaks Your program should deal with words that contain punctuation. You can ignore these words, but your program should not die/stop if such a word is encountered. You should make your own small test files, but check your final program with the input file words accessible on acpub as ~ola/data/words. (This file is /usr/dict/words from Linux.)

Coding and Algorithm

You must use the class Anaword whose declaration is given in the file anaword.h. You will need to write the implementation of this class in the file anaword.cpp although this has been started for you. An Anaword object is constructed from a string, and prints as the string, but is compared using a normalized or canonical form created by counting the number of times each letter in the word occurs (more on this below). For example, the code fragment below prints the two lines of output shown. &lt;br&gt; Anaword a("bagel");&lt;br&gt; Anaword b("gable");&lt;br&gt;&lt;br&gt; cout &amp;amp;lt;&amp;amp;lt; a &amp;amp;lt;&amp;amp;lt; " " &amp;amp;lt;&amp;amp;lt; b &amp;amp;lt;&amp;amp;lt; endl;&lt;br&gt; if (a == b) cout &amp;amp;lt;&amp;amp;lt; "they're ananagrams!" &amp;amp;lt;&amp;amp;lt; endl;&lt;br&gt; Output as shown:
bagel gable they're anagrams! The objects a and b are equal because the operator == is overloaded for Anaword objects and returns true for "bagel" and "gable" since both words have the same normalized/canonical form of letter-signature: 1 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 which indicates one 'a', one 'b', one 'e', one 'g', and one 'l'. The signature of aardvark, for example, is 3 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 1 0 0 0 0 You must implement the member functions declared and described in anaword.h so that the real word (e.g., "bagel") is used for printing, but the canonical/normal form of the letter-signature is used for comparison using == and <
To do this you'll need to implement all the functions declared in anaword.hthat are not implemented in the anaword.cpp file you're given. This includes
  • Anaword::normalize which creates the signature for an Anaword
  • Anaword::equal which determines if two Anawords are equal
  • Anaword::less which determines if one Anaword is less than another.
For your program to run efficiently your implementations of Anaword::equal and Anaword::less should return true/false after making the minimal number of comparisons possible. Treat the vector myCounts as a 26-digit number so that, for example, "egg" > "ego" since the signatures are
egg: 00001020000000000000000000 ego: 00001010000000100000000000 The determination that "egg" is larger can be made after seven comparisons (the count of g's in egg is greater than the count of g's in ego and 'g' is the seventh letter of the alphabet).

A Working Program

You'll need to add code to doana.cpp to yield a working anagram program. This requires completing the function FindAnagrams using the idea below
  1. Sort the vector of Anaword objects using quicksort. You can do this with &lt;br&gt; QuickSort(list, list.size());&lt;br&gt; because Anaword ojbects can be compared using <, ==, and <=. Access to QuickSort is via #include "sortall.h".
  2. All anagrams are now adjacent in the vector. Make one pass over the elements looking for adjacent elements that are equal. You must write code to determine when two or more adjacent words are equal and print anagrams one set per line as shown at the beginning of this assignment.

Faster Anagrams

You'll now develop another method of canonicalizing an Anaword that's faster than the letter-signature method. First, when you've got the program working, you should add code to time how long it takes to find and print all anagrams in ~ola/data/words which is words, a file of 45,402 words. Write down the name of the machine you used and the time it takes and include these in your README file you submit. You should then create copies of both anaword.h and anaword.cpp by typing cp anaword.h anawordfinger.h cp anaword.cpp anawordfinger.cpp Now you have a copy of your (hopefully) correct Anaword class and implementation. You'll be re-implementing Anaword using another technique, but you'll need to submit both versions so you must make a copy. Instead of using the letter-signature method you'll store a sorted form of the word and use this to compare Anawords. For example, the sorted form of "bagel" and "gable" is the same, it's the string "abegl". This string is used for relational comparisons. To do this you'll need to make a few changes:
  1. Remove the declaration below from the private section of Anaword. &lt;br&gt; tvector&amp;amp;lt;int&amp;amp;gt; myCounts; // canonicalized form&lt;br&gt; Then you'll add a new declaration for the sorted word: &lt;br&gt; string mySortedWord; // canonicalized form&lt;br&gt;
  2. The function Anaword::equal is now one line: &lt;br&gt; return mySortedWord == rhs.mySortedWord;&lt;br&gt;
  3. The function Anaword::less is similar: &lt;br&gt; return mySortedWord &amp;amp;lt; rhs.mySortedWord;&lt;br&gt;
  4. The only other change needed is the function Anaword::normalize In this function you should sort the letters in mySortedWord which is a copy of myWord. To sort, use the code from selection sort which can be found in both the Sedgwick and Astrachan texts.
  5. Be sure to change the comments in anaword.h to reflect the new method used.
When the program works, time it again on the same machine you timed the original program. Write the time in the README file you create. You should also write a paragraph or explaining why you think the timings are different.

Grading

This assignment is worth 15 points. Style of the code/program counts for 5/15, correctness is 8/10 and the README is 2/10.

Submit

You'll submit a README that describes the timings of both versions of the program, includes information about how much time you spent on the assignment, and a list of people with whom you discussed the program. Submit all .h and .cpp files and the README file. You do not need to submit the Makefile unless you modified it.
To submit use
submit_cps100 anagram README *.h *.cpp If submit_cps100 doesn't work, try ~ola/bin/submit100

Extra Credit

This is worth four points. Instead of sorting using < and == for Anawords, create a different ordering so that when anagrams are printed shorter words are printed first and longer words printed last. To do this, create a function object as described in pages 543-549 of the Tapestry text. This object will be used when sorting the vector of Anawords, e.g., you'll write &lt;br&gt; AnaLenComparer analen;&lt;br&gt; ...&lt;br&gt; QuickSort(list,list.size(), analen);&lt;br&gt;
The struct/class AnaLenComparer should compare two Anaword objects, using the length of the strings in the objects as the first criteria of comparison, and using Anaword::less only if the strings have the same length. For example: &lt;br&gt; Anaword a("tear");&lt;br&gt; Anaword b("ace");&lt;br&gt; Anaword c("race");&lt;br&gt; Here we want b to be "less" than a since it has fewer characters, but c is less than a since although they have the same number of characters, the canonicalized forms show that "acer" < "aert".