Wordcnt.c, an ANSI-C utility to find word occurrences in any type of file
(both text and binary)
This page describes a utility that I
wrote for Zax in a newsgroup called alt.privacy.anon-server.
Zax had been experimenting with a free newsserver for APAS for some time
when he asked in alt.fan.cult-dead-cow for a utility to find frequent words
in text files. There seemed to be few, some worse than others and most
payware. Since I liked Zax a lot (compared to the average usenet dweller
I troll along), and I had four years of computer and software related education
in my pocket, I decided to wield the omnipotent Turbo C 2.01 compiler from Borland once again and start making this world
a better place for both software and computer hardware.
So here is it, wordcnt.
Some of the design principles I used were:
-
no memory allocation
-
no unchecked buffers
-
no errors that the user wouldn't understand
-
no nagging the user with useless information
-
not being picky about the input to this program
-
no analytical tools that other programs (like Excel
or Open Office) would always excel
at
-
an easy to use and understand data format
-
no crashing, core dumping or out of memory errors that would only annoy
the user
-
no assumptions on the input, other that they were usenet generated (IOW,
they could be anything)
-
no gulping up system resources in a multi-tasking environment so the program
could be run from a script on a (usenet-) server
-
no lack of comments that would make my code hard to understand
-
no code that I didn't understand or code that would make the program run
a lot slower at no particular gain
So those are a lot of features not common in any other kind of software,
but what does the program actually do? Well, it counts words
from a file and spits out the results in n^2 complexity and n memory (each
registered word will take up 8 bytes, one checksum of four bytes and one
big counter of four bytes). In the 16 bit MS-DOS executable n will be about
7000 unique words and in the 32 bit Windows executable it will take up
about 200000 64 bit memory structures. That is, it will never use up more
memory and never go beyond n^2 complexity. So these are definite constraints
that will always be met. Some people might call this 'mission-critical'
programming, but I think it is more with this program since it won't ever
crash or behave unexpectedly either.
Since the code is statistically optimized (based on random input of
sometimes identical words), it should actually be a lot faster than O(n^2)
most of the time. I heard it outperformed an ANSI C++ black-red library
(whatever a red-black list might be). This is due to statistics of course,
in a worse case situation a weighted binary tree with some hash table properties
seems like a good choice to me. This extra speed would come at a price
however, a price I am not willing to pay for this program since it seems
fast enough for the purpose it was written for (see above).
Since you are still reading this, you are probably not a very accomplished
programmer yet (or you would have gotten to the code already), so here
are some tips on how to use this program in a windows environment.
You can give three command line options to wordcnt.exe and wordcnt32.exe.
Giving no options will show a short description of the program. If you
want to use the program you will need to specify an existing filename,
I am sorry that the error messages you will get if opening this file fails
won't be much use to most users. This program was not really written with
novice users in mind, yet the program will take all the abuse a user will
be able to throw at it as far as I can see.
If you specify a second parameter this needs to be a number. This is
the number of the minimal word length the words you are looking for must
be. You should always specify this as this is a major speedup variable
to the program. Let me explain, if I only count larger words, I won't have
to keep track of smaller words and n will be smaller in O(n^2).
If you specify a third parameter, this also needs to be a number. This
number also speeds up the program if set and does so by removing all the
words that have lower counts before printing them. Because of the way I
have implemented or coded this utility, execution time could theoretically
be cut in half if you set this number as high as is permitable. In the
situation of Zax, who I coded this program for, I could imagine he would
set this number to 100 or something in that area. We are trying to tag
a dictionary flooder on usenet.
So that is how the program works on its own. There are some command
line utilities shipped with windows which will make your life a lot easier:
-
You can redirect the output of this program to a file like this: "wordcnt
wordcnt.zip 1 1 > myfile.txt". This also works in linux (which I think
it was stolen from).
-
You can sort the output of this program with a utility called sort like
this: "sort /+10 myfile.txt > mysorted.txt". This will sort the file alphabetically.
You can also sort on count (which I would deem more useful).
-
You can edit the output of this program with a utility called edit like
this: "edit mysorted.txt".
So there you have it, that is all I have to say about this program. I hope
it will be supported by a lot of people in the future and that this program
will lead a long and happy life. As a final though about programming: "Programs
don't crash computers, programmers and designers crash computers".
-- Don't believe the hype (Chuck D).