inverted index java coding needed
Basic concept
The inverted index is a tool that maps occurrences of data (in our case, words) back to locations in a structure. This is frequently used to support full text support for information stored in databases. It is typically used to return records or documents that contain the queried word. We are going to use it to return lines of text.
There are a number of ways that we can implement the inverted index — it basically just requires an associative structure. Of course, we will be implementing it with a hash map. The keys in this case will be words. The values will be a list of the lines in which that word occurred. When a user queries the index with a word, she will be able to see all of the lines containing the queried word.
Part one: The hash map [15 pts]
The first thing you will need to implement is a hash map in a class called HashMap
. This should fullyimplement
the Map interface we have been working with. The storage within the HashMap
should be provided by an array, and you should use external chaining to resolve collisions. For a hashing function, use thehashCode()
method that is already defined for all Objects
.
In addition to the methods required by the Map
interface, you should implement the following additional methods on the HashMap
:
-
public HashMap(int size)
- This is the constructor. The size variable determines the number of slots in the underlying hash map.
-
public int getMaxChainLength()
-
This method should return the length of the longest chain in the
HashMap
. -
public double getLoadFactor()
-
This method should return the load factor of the
HashMap
.
Part two: Building the index [35 pts]
The index should be implemented in a new class called InvertedIndex
. The interface of this class should look like this:
-
public InvertedIndex(File file)
-
The constructor of the class. This reads in a
File
and builds the index. -
public InvertedIndex(File file, int tableSize)
- This performs the same function as above, but allows the size of the hash table to be specified (the first constructor should just call this one with a default value).
-
public Iterator<Integer> getLineNumberIterator(String word)
-
Returns an
Iterator
that walks through each line number associated with a particular word. The
Iterator
does not need to implement
remove
. -
public Iterator<String> getHighlightedLines(String word)
-
Returns an
Iterator
that returns Strings containing the line number, followed by a colon, followed by a line containing at least one instance of the word. Each instance of the word should be surrounded by # characters. This should also not implement remove. -
public int numOccurrences(String word)
- Return the number of occurrences of the word in the document.
The constructor will take the file and store it, line by line, in an ArrayList (use the one in java.util
). This is just a copy of the document that we can easily access by line number. The HashTable
will be used to maintain our inverted index, which will map words (the keys) to occurrences in the document. We want to store both line number and word number within the line. You should create a simple wrapper class for holding these two pieces of information. Most words will occur more than once in a document, so the “value” stored in the hash table will be a List
of the small wrapper objects. Use the LinkedList
class fromjava.util
for this purpose. As an implementation detail, convert all of the words to lowercase before you store them in the hash table, this will allow for case insensitive searching.
Part three: An application [10 pts]
Add a main
method to your InvertedIndex
class. This should take two arguments on the command line: the first should be a number which should determine the number of slots in the hash table, and the second will be a file name. You will, of course, use the file and size to build an InvertedIndex
instance. Print out the two hash table statistics available through the methods specified earlier. You should add some methods to the InvertedIndex
to allow acces to these.
Once the statistics are printed out, your application should enter an interactive mode. When the user enters a word, your index should look it up and return the lines that contain it. Each line will be printed out in the following format (the < and > just indicate that they are fields being described, they shouldn’t be included in the actual output):
<line number>: <text of the line with all instances of the word surrounded by the # character.>
If there are no matching lines, there should be an appropriate error message. When the user uses theRETURN
key without entering any text, the application should quit.