· f

F#: Word Count - A somewhat failed attempt

I came across Zach Cox’s word count problem via Sam Aaron and Ola Bini’s twitter streams and I thought it’d be interesting to try it out in F# to see what the solution would be like.

The solution needs to count word frequencies from a selection of newsgroup articles.

I wanted to see if it was possible to write it in F# without using a map to keep track of how many of each word had been found.

My thinking was that I would need to keep all of the words found and then calculate the totals at the end.

After a bit of fiddling this is the version I ended up with:

word-count.fsx


#light
open System
open System.IO
open System.Text.RegularExpressions

let (|File|Directory|) path = if(Directory.Exists path) then Directory(path) else File(path)
let getFileSystemEntries path = Directory.GetFileSystemEntries path |> Array.to_list

let files path = 
    let rec inner fileSystemEntries files =
        match fileSystemEntries with
            | [] -> files
            | File path :: rest -> inner rest (path :: files)  
            | Directory path :: rest -> inner (List.append rest (getFileSystemEntries path)) files  
    inner (getFileSystemEntries path) []

let downloadFile path = 
    use streamReader = new StreamReader(File.OpenRead path)
    streamReader.ReadToEnd()    
    
let words input= Regex.Matches(input, "\w+") |> Seq.cast |> Seq.map (fun (x:Match) -> x.Value.ToLower())
 
let wordCount = files >> 
                List.map downloadFile >>
                List.map words >>
                List.fold (fun acc x -> Seq.append acc x) Seq.empty >> 
                Seq.groupBy (fun x -> x) >> 
                Seq.map (fun (value, sequence) -> (value, Seq.length sequence))

let writeTo (path:string) (values:seq<string * int>) = 
    use writer = new StreamWriter(path)
    values |> Seq.iter (fun (value,count) -> writer.WriteLine(value + " " + count.ToString()))  

let startTime = DateTime.Now
let count = wordCount "Z:\\20_newsgroups"

printfn "Writing counts in alphabetical order"
count |> Seq.sort |> writeTo "C:\\results\\counts-alphabetical-fsharp.txt"

printfn "Writing counts in descending order"
count |> Seq.sortBy (fun (_, count) -> count * -1) |> writeTo "C:\\results\\counts-descending-fsharp.txt"

let endTime = DateTime.Now
printfn "Finished in: %d seconds" (endTime - startTime).Seconds 

The problem is that this version results in a StackOverFlow exception when I try to execute it with all the newsgroup articles although it does work correctly if I select just one of the folders.

From what I can tell the exception happens on line 24 when I get the text out of each of the files and store it in the list.

I tried changing this bit of code so that instead of doing that I combined the ‘words’ and ‘downloadFile’ functions so that the whole string wouldn’t be saved but this doesn’t seem to help that much. The exception just ended up happening a bit further down.

I’m not sure if it’s possible to make this work by making use of lazy collections - I’m not that familiar with those yet so I”m not sure how to do it - or if this approach is just doomed!

Despite the fact it doesn’t quite work there were some interesting things I noticed while playing with this problem:

If anyone has any ideas about how I can get this not to blow up that would be cool!

These are some of the other solutions that I’ve come across:

  • LinkedIn
  • Tumblr
  • Reddit
  • Google+
  • Pinterest
  • Pocket