Wordcnt.c, an ANSI-C utility to find word occurrences in any type of file (both text and binary)

This page describes a utility that I wrote for Zax in a newsgroup called alt.privacy.anon-server. Zax had been experimenting with a free newsserver for APAS for some time when he asked in alt.fan.cult-dead-cow for a utility to find frequent words in text files. There seemed to be few, some worse than others and most payware. Since I liked Zax a lot (compared to the average usenet dweller I troll along), and I had four years of computer and software related education in my pocket, I decided to wield the omnipotent Turbo C 2.01 compiler from Borland once again and start making this world a better place for both software and computer hardware.

So here is it, wordcnt.

Some of the design principles I used were:

no memory allocation
no unchecked buffers
no errors that the user wouldn't understand
no nagging the user with useless information
not being picky about the input to this program
no analytical tools that other programs (like Excel or Open Office) would always excel at
an easy to use and understand data format
no crashing, core dumping or out of memory errors that would only annoy the user
no assumptions on the input, other that they were usenet generated (IOW, they could be anything)
no gulping up system resources in a multi-tasking environment so the program could be run from a script on a (usenet-) server
no lack of comments that would make my code hard to understand
no code that I didn't understand or code that would make the program run a lot slower at no particular gain

So those are a lot of features not common in any other kind of software, but what does the program actually do? Well, it counts words from a file and spits out the results in n^2 complexity and n memory (each registered word will take up 8 bytes, one checksum of four bytes and one big counter of four bytes). In the 16 bit MS-DOS executable n will be about 7000 unique words and in the 32 bit Windows executable it will take up about 200000 64 bit memory structures. That is, it will never use up more memory and never go beyond n^2 complexity. So these are definite constraints that will always be met. Some people might call this 'mission-critical' programming, but I think it is more with this program since it won't ever crash or behave unexpectedly either.

Since the code is statistically optimized (based on random input of sometimes identical words), it should actually be a lot faster than O(n^2) most of the time. I heard it outperformed an ANSI C++ black-red library (whatever a red-black list might be). This is due to statistics of course, in a worse case situation a weighted binary tree with some hash table properties seems like a good choice to me. This extra speed would come at a price however, a price I am not willing to pay for this program since it seems fast enough for the purpose it was written for (see above).

Since you are still reading this, you are probably not a very accomplished programmer yet (or you would have gotten to the code already), so here are some tips on how to use this program in a windows environment.

You can give three command line options to wordcnt.exe and wordcnt32.exe. Giving no options will show a short description of the program. If you want to use the program you will need to specify an existing filename, I am sorry that the error messages you will get if opening this file fails won't be much use to most users. This program was not really written with novice users in mind, yet the program will take all the abuse a user will be able to throw at it as far as I can see.

If you specify a second parameter this needs to be a number. This is the number of the minimal word length the words you are looking for must be. You should always specify this as this is a major speedup variable to the program. Let me explain, if I only count larger words, I won't have to keep track of smaller words and n will be smaller in O(n^2).

If you specify a third parameter, this also needs to be a number. This number also speeds up the program if set and does so by removing all the words that have lower counts before printing them. Because of the way I have implemented or coded this utility, execution time could theoretically be cut in half if you set this number as high as is permitable. In the situation of Zax, who I coded this program for, I could imagine he would set this number to 100 or something in that area. We are trying to tag a dictionary flooder on usenet.

So that is how the program works on its own. There are some command line utilities shipped with windows which will make your life a lot easier:

You can redirect the output of this program to a file like this: "wordcnt wordcnt.zip 1 1 > myfile.txt". This also works in linux (which I think it was stolen from).
You can sort the output of this program with a utility called sort like this: "sort /+10 myfile.txt > mysorted.txt". This will sort the file alphabetically. You can also sort on count (which I would deem more useful).
You can edit the output of this program with a utility called edit like this: "edit mysorted.txt".

So there you have it, that is all I have to say about this program. I hope it will be supported by a lot of people in the future and that this program will lead a long and happy life. As a final though about programming: "Programs don't crash computers, programmers and designers crash computers".

-- Don't believe the hype (Chuck D).