spell check - how to?
Robert Huff
roberthuff at rcn.com
Fri Nov 7 13:43:27 UTC 2008
Giorgos Keramidas writes:
> The main drawback of being unable to use the `freebsd' wordlist
> is that you will get many false positives for words that are
> perfectly valid for FreeBSD documentation but are not standard
> English words.
I have a script which does something similar, using ispell.
It's based on the Perl script - found on-line - appended below.
I pseudo-fixed that running the output through sort and
starting with least frequent hits.
Attempts to build a project-specific dictionary proved too
confusing and it was ultimatly not worth the effort.
Robert Huff
#!/usr/local/bin/perl -W
# WordFreq.pl -- Count word frequency in a text file
$ver = "v1.0"; # 05-Dec-2001 JP Vossen {jp at jpsdomain.org>
# Basics from 8.3, page 280 of _Perl_Cookbook_
# Added stop words
(($myname = $0) =~ s/^.*(\/|\\)|\..*$//ig); # remove up to last "\" or "/" and after any "."
$Greeting = ("$myname $ver Copyright 12001 JP Vossen (http://www.jpsdomain.org/)\n");
$Greeting .= (" Licensed under the GNU GENERAL PUBLIC LICENSE:\n");
$Greeting .= (" See http://www.gnu.org/copyleft/gpl.html for full text and details.\n"); # Version and copyright info
%seen = (); # Create the hash
# Define the stopwords
@stopwords = ("a", "an", "and", "are", "as", "at", "be", "but", "by",
"does", "for", "from", "had", "have", "her", "his", "if", "in", "is",
"it", "not", "of", "on", "or", "that", "the", "this", "to", "was",
"which", "with", "you");
if (("@ARGV" =~ /\?/) || (@ARGV > 5) || (@ARGV < 0)) { #if wrong # of args, or a ? in args - die
print STDERR ("\n$Greeting\n\tUsage: $myname -i {infile} [-s]\n");
print STDERR ("\nIf -s is used, the list of stop words will NOT be used.\n");
print STDERR ("The stopwords currently defined are:\n\n ");
foreach $stopword (@stopwords) {
print STDERR ("$stopword ");
} # end of foreach stopword
die ("\n");
}
use Getopt::Std; # User Perl5 built-in program argument handler
getopts('i:o:s'); # Define possible args.
if (! $opt_i) { $opt_i = "-"; } # If no input file specified, use STDIN
if (! $opt_o) { $opt_o = "-"; } # If no output file specified, use STDOUT
open (INFILE, "$opt_i") || die "$myname: error opening $opt_i $!\n";
open (OUTFILE, ">$opt_o") || die "$myname: error opening $opt_o $!\n";
print STDERR ("\n$Greeting\n");
while (<INFILE>) { # Read the input file
while ( /(\w['\w-]*)/g ) { # If we have a "word"
$seen{lc $1}++; # Count it in the hash
} # end of while words
} # end of while input
if (! $opt_s) { # If we're using stopwords
foreach $stopword (@stopwords) { # for each stopword
delete($seen{$stopword}); # Remove it from the hash
} # end of foreach stopword # This way we only test once for each
} # end of if using stopwords stopword, rather than in a loop!
# Print the results, sorted most frequent words at the top
foreach $word ( sort { $seen{$b} <=> $seen{$a} } keys %seen) {
printf OUTFILE ("%6d %s\n", $seen{$word}, $word);
} # end of foreach word
More information about the freebsd-doc
mailing list