Advertisements
Home > Information Technology, programming > Counting Word Occurrence in a TextFile Using Efficient Storage For Faster Retrieval

Counting Word Occurrence in a TextFile Using Efficient Storage For Faster Retrieval

The previous day I applied for a position as a JAVA developer for Harita Mineral. I passed my second try in logic test after failing it in the first attempt. So, the next step was the technical test, where I was being judged on how I solve a problem by coding the real deal. The technical test for the company I applied was:

Goal of the Technical Test:
We are to see how you organize your thought, construct your code in accordance to good Object Oriented Programming’s Practice, and how you usedata structure to store your data or search algorithm (if you use any) within the given problem.

Test:
You are to write a small Java Program to read and parse words from any given html file then store the words into the memory. User can search any words that he/she wishes to and the program will find the words from the memory and output the result of the search. Read the requirement below for more details.

Requirement:
1. User can enter file that they want the program to read. The file must be in the same directory as the class file.
2. User can enter word to be search as many times as they want
3. User can enter numerical number to give sign to program that they want to exit
4. Program will read and parse words in the given filename by user and store them in memory to search (output the measure time of reading and parsing words)
5. Program will output the number of occurrence of the word that users want to search if the word search exist in memory
6. Program will output the number of milliseconds needed to search the given word.
7. No database usage allowed.
8. Only 1 java file allowed
9. Only JDK Standard Edition (1.6) with standard API you can use to construct the program. (read as = no outside jars of standard issue JDK allowed)
10. Write the program within maximum 150 lines of code (don’t use any comment)

———
Example of Expected Output for the program:

C:>java RunTest
enter filename to read => overview-summary.html (user presses ENTER)
Total unique words = 1000 words – it takes 170 ms
enter word to search => java (user presses ENTER)
Found word java with total 20 occurrences in file overview-summary.html – it takes 90 ms
enter word to search => xml (user presses ENTER)
Found word xml with total 5 occurrences in file overview-summary.html – it takes 100 ms
enter word to search => harita (user presses ENTER)
Not found word harita in file overview-summary.html – it takes 200 ms
enter word to search => 1 (user presses ENTER)
Thank you for using RunTest
C:>

You have 3 days to submit your Java code starting from today.
You may ask questions to clarify any of the instructions written here.

So, to tackle this problem effectively we need to elaborate it into small pieces of tasks:

1. Deciding the correct library to use

We need a function to read input from keyboard as well as reading input from a file. Therefore, we need an input/ output library to handle this function which resides in the java.io. Next is to break the text we acquire from the defined file to only accept words. StreamTokenizer is a fast way to do this since you can define what characters to be a word, number, or symbol. StreamTokenizer also resides in java.io. Now, to store the words into a storage, we can use the library from java.util. The choices are many such as ArrayList, LinkedList, HashTable, HashMap, and so on. I personally choose HashMap since it can promote faster retrieval and element storing. I’ll be explaining more on this later.

So, in the first step, we got ourselves the code like this:

import java.io.*;
import java.util.*;

public class WordCount {
   public static void main (String[] args){

   }
}

2. Reading user input for what file to be read

Next, comes the problem to read user’s keyboard input. We need System.in to access the input from the command line and a converter called InputStreamReader to convert the input into readable format for JAVA and BufferedReader to convert it further to human readable format by calling the readline function in the BufferedReader class. Oh yeah, we need to catch an exception in accessing the readline function so you need a try method for that.

Anyway, here’s the code that you should have now:

public class WordCount {
   public static void main (String[] args){
      String inputText = "";
      System.out.print("Enter a filename to read: ");
      InputStreamReader converter = 
          new InputStreamReader(System.in);
      BufferedReader in = new BufferedReader(converter);
      try {
         inputText = in.readLine();
      }

      catch(Exception e)  {
         System.out.println(""+e);
      }
   }
}

3. Reading the user-defined textFile

This is an easy one. You can use FileReader to access the specific file. However, you might want to also store it in the BufferedReader for the usage of StreamTokenizer class for defining delimiter to break down the sentence into words.

Therefore, the code now should be like this:

public class WordCount {
   public static void main (String[] args){
      String inputText = "";
      System.out.print("Enter a filename to read: ");
      InputStreamReader converter = 
            new InputStreamReader(System.in);
      BufferedReader in = new BufferedReader(converter);
      try {
         inputText = in.readLine();
         BufferedReader inputFile = 
             new BufferedReader(new FileReader(inputText));
      }

      catch(Exception e)  {
         System.out.println(""+e);
      }
   }
}

4. Breaking down the sentence into words

This is one of the core function that need’s to be sorted out, so read carefully. StreamTokenizer class provide a handy amount of features to store our input. The first you want to do is to put the file that has been read by the system in the BufferedReader to a newly invoked StreamTokenizer class. Next is to define what characters you want to consider as words and be stored in a public variable of StreamTokenizer called “sval”. To do this, reset all the characters by using resetSyntax function to put all the characters to ordinary characters. Then, use wordChars function to define which characters to be words. In this case, usewordChars(‘A’, ‘Z’) and wordChars(‘a’,’z’) to define all words to be alphabetic only. Now, to iterate all the words that has been stored, you can use nextToken function and check whether the token is a word by checking whether the value of token is StreamTokenizer.TT_WORD (constant type to indicate that the token is a word).

You now should have the code like this:

public class WordCount {
   public static void main (String[] args){
      String inputText = "";
      System.out.print("Enter a filename to read: ");
      InputStreamReader converter = new InputStreamReader(System.in);
      BufferedReader in = new BufferedReader(converter);
      try {
         inputText = in.readLine();
         BufferedReader inputFile = 
              new BufferedReader(new FileReader(inputText));
         StreamTokenizer st = new StreamTokenizer(inputFile);
         st.resetSyntax();
         st.wordChars('A','Z');
         st.wordChars('a','z');

         int found = StreamTokenizer.TT_WORD;
         String keyBuffer = "";
         while ((found != StreamTokenizer.TT_EOF))   {
            found = st.nextToken ();
            if(found == StreamTokenizer.TT_WORD) {
               String token = st.sval;
            }
         }
      }

      catch(Exception e)  {
         System.out.println(""+e);
      }
   }
}

5. Storing the retrieved word into HashMap

We’re using HashMap because of the easiness and efficiency usage of this particular storage. HashMap introduce the concept of key and value, where to access a value that has been stored into the HashMap, you need a certain key value. Think of it as a door, where if you want to access to what inside the door, you’ll need a certain key to open the door. Therefore, we can use the word as the key and the number of occurence of that particular word as the value to the corresponding key.

Now, we need to always check whether the key (the word) has already stored in the HashMap. We can use the get function in the HashMap class to check whether key has already been registered to the HashMap. If the get function returned null, it means that it is a new word. Use the put function to store that word as the key and “1″ as the number of occurence of the value of the corresponding value.

On the other hand, if the get function do not return null, that means that the key has already been registered to the HashMap. Therefore we need to override the value of that particular key by using the put function. The brilliant thing about HashMap that it only allows unique keys to be registered. Any put function to an existing key will resulted in the value of the key to be overwrite.

Now, the code you have been doing so far, should be like this:

public class WordCount {
   public static void main (String[] args){
      String inputText = "";
      HashMap<String,Integer> wordTank = 
             new HashMap<String,Integer>();
      System.out.print("Enter a filename to read: ");
      InputStreamReader converter = 
             new InputStreamReader(System.in);
      BufferedReader in = new BufferedReader(converter);
      try {
         inputText = in.readLine();
         BufferedReader inputFile = 
             new BufferedReader(new FileReader(inputText));
         StreamTokenizer st = new StreamTokenizer(inputFile);
         st.resetSyntax();
         st.wordChars('A','Z');
         st.wordChars('a','z');

         int found = StreamTokenizer.TT_WORD;
         String keyBuffer = "";
         while ((found != StreamTokenizer.TT_EOF))   {
            found = st.nextToken ();
            if(found == StreamTokenizer.TT_WORD) {
               String token = st.sval;
               if(wordTank.get(token) == null) {
                  wordTank.put(token,1);
               }
               else {
                  wordTank.put(token,wordTank.get(token)+1);
               }
            }
         }
         System.out.println("Total Unique words: " + wordTank.size());
      }

      catch(Exception e)  {
         System.out.println(""+e);
      }
   }
}

6. Retrieving a word from user input

The next requirement is to provide a search feature to check whether a word that is typed by a user exists in the storage. If it exists, display the occurrence of that particular word. By now you should’ve guess that we need another readLine function to read user’s keyboard input. Since the search feature is repeated until the user types a numeric number to quit, put the next line code on an infinite loop and parse the user input into integer by using Integer.parseInt(string variable) function. If the input is a number, the system quits, while non-numeric will bring you to the retrievement of the word that is inputted by the user.

To retrieve the word in the storage, use the usual get function in the HashMap. If it returns null, that means the word that is being searched is not there and vice versa. So far, you should have this code now:

public class WordCount {
   public static void main (String[] args){
      String inputText = "";
      HashMap<String,Integer> wordTank = 
         new HashMap<String,Integer>();
      System.out.print("Enter a filename to read: ");
      InputStreamReader converter = 
         new InputStreamReader(System.in);
      BufferedReader in = new BufferedReader(converter);
      try {
         inputText = in.readLine();
         BufferedReader inputFile = 
            new BufferedReader(new FileReader(inputText));
         StreamTokenizer st = new StreamTokenizer(inputFile);
         st.resetSyntax();
         st.wordChars('A','Z');
         st.wordChars('a','z');

         int found = StreamTokenizer.TT_WORD;
         String keyBuffer = "";
         while ((found != StreamTokenizer.TT_EOF))   {
            found = st.nextToken ();
            if(found == StreamTokenizer.TT_WORD) {
               String token = st.sval;
               if(wordTank.get(token) == null) {
                  wordTank.put(token,1);
               }
               else {
                  wordTank.put(token,wordTank.get(token)+1);
               }
            }
         }
         System.out.println("Total Unique words: " + wordTank.size());
         Integer occurrence;
         while(true) {
            System.out.print("Enter word to search => " );
            inputText = in.readLine();
            try {
               Integer.parseInt(inputText);
               System.out.println("Thanks for using WordCount");
               break;
            }
            catch(Exception e) {}
            occurrence = wordTank.get(inputText);
               if(occurrence != null) {
                  System.out.println("Found word " + inputText + 
                      " with total occurrences of " +
                      occurrence);
               }

               else{
                  System.out.println("Can't found word " + 
                     inputText + " in the file.");
               }
          }
      }

      catch(Exception e)  {
         System.out.println(""+e);
      }
   }
}

7. Calculate your algorithm performance

Now, you need to test whether HashMap and StreamTokenizer can provide you faster storing and retrieving element in an array-like  structure. Use System.currentMillis() to get the current millisecond and put it between the start and the end of coding block that you want to analyze.

Now, the final code should be like this:

public class WordCount {
   public static void main (String[] args){
      String inputText = "";
      HashMap<String,Integer> wordTank = 
          new HashMap<String,Integer>();
      System.out.print("Enter a filename to read: ");
      InputStreamReader converter = 
          new InputStreamReader(System.in);
      BufferedReader in = new BufferedReader(converter);
      try {
         inputText = in.readLine();
         long startTime = System.currentTimeMillis();
         long endTime;
         BufferedReader inputFile = 
             new BufferedReader(new FileReader(inputText));
         StreamTokenizer st = new StreamTokenizer(inputFile);
         st.resetSyntax();
         st.wordChars('A','Z');
         st.wordChars('a','z');

         int found = StreamTokenizer.TT_WORD;
         String keyBuffer = "";
         while ((found != StreamTokenizer.TT_EOF))   {
            found = st.nextToken ();
            if(found == StreamTokenizer.TT_WORD) {
               String token = st.sval;
               if(wordTank.get(token) == null) {
                  wordTank.put(token,1);
               }
               else {
                  wordTank.put(token,wordTank.get(token)+1);
               }
            }
         }
         System.out.println("Total Unique words: " + 
            wordTank.size() + " it takes " +
            (System.currentTimeMillis()-startTime) + " ms\n");
         Integer occurrence;
         while(true) {
            System.out.print("Enter word to search => " );
            inputText = in.readLine();
            try {
               Integer.parseInt(inputText);
               System.out.println("Thanks for using WordCount");
               break;
            }
            catch(Exception e) {}
            startTime = System.currentTimeMillis();
            occurrence = wordTank.get(inputText);
            endTime = System.currentTimeMillis();
               if(occurrence != null) {
                  System.out.println("Found word " + 
                     inputText + " with total occurrences of " +
                     occurrence + " in the file. It takes " + 
                     (endTime-startTime) + " ms.");
               }

               else{
                  System.out.println("Can't found word " + 
                     inputText + " in the file. It takes " +
                     (endTime-startTime) + " ms.");
               }
          }
      }

      catch(Exception e)  {
         System.out.println(""+e);
      }
   }
}

Test that with your desired text file. How fast did the algorithm perform in compare to using standard array and manual tokenizing?

Advertisements
  1. May 25, 2015 at 11:36 am

    Hello there I am so glad I found your blog, I really found
    you by error, while I was browsing on Bing for something else, Anyways
    I am here now and would just like to say cheers for a marvelous post
    and a all round interesting blog (I also love the theme/design), I don’t have time to look
    over it all at the minute but I have book-marked it and also added in your RSS feeds, so when I have time I will be back to read much more, Please do keep up the excellent work.

  2. July 1, 2015 at 7:25 am

    I don’t drop a ton of responses, however i did some
    searching and wound up here Counting Word Occurrence in a TextFile Using Efficient Storage For Faster Retrieval | Knowledge Junk.
    And I do have a few questions for you if it’s allright.
    Could it be only me or does it look like some of the comments look
    as if they are left by brain dead people? 😛 And, if you are writing on additional
    online sites, I’d like to keep up with everything fresh you have to post.
    Would you make a list of every one of your shared sites
    like your Facebook page, twitter feed, or linkedin profile?

  3. July 4, 2015 at 6:40 am

    Wonderful, what a website it is! This web site presents helpful facts to us,
    keep it up.

  4. October 6, 2015 at 6:30 am

    Unquestionably consider that which you said. Your favorite reason seemed
    to be at the web the simplest factor to be mindful of.
    I say to you, I definitely get annoyed whilst other people consider issues that they just do not recognize about.
    You managed to hit the nail upon the top as neatly as
    outlined out the entire thing with no need side-effects , people
    could take a signal. Will probably be again to
    get more. Thank you

  1. July 19, 2015 at 11:54 pm

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s

%d bloggers like this: