Lunch break coding - Word Break Problem

Simple programming puzzle to break up words according to dictionary.

Posted on August 15, 2011 haskell, puzzle, algorithm .

Another funny little programming puzzle that I found here describes the Word Break Problem.

Given an input string and a dictionary of words, segment the input string into a space-separated sequence of dictionary words if possible. For example, if the input string is “applepie” and dictionary contains a standard set of English words, then we would return the string “apple pie” as output.

So let’s give it a shot at using haskell:

First we need some imports and a convenient data type for the dictionary that we are fed:

import Data.List(inits)
import qualified Data.Set as S
import Maybe(isJust)
import Control.Applicative((<$>))
import System(getArgs)
type Dictionary = S.Set String

Next the formulation of the problem itself:

breakWords ::  String -> Dictionary -> Maybe [String]
breakWords xs dict = splittUp xs [] where
  splittUp :: String -> [String] -> Maybe [String]
  splittUp [] rs = Just $ reverse rs -- done, give back result
  splittUp xs rs = safeHead $ filter isJust solutions where
    solutions = [splittUp rest (match:rs) | (match,rest) <- findMatches xs dict]

And of course the helper functions we need for this to work:

findMatches :: String -> Dictionary -> [(String,String)]
findMatches xs dict =
  [(x,drop (length x) xs) | x <- reverse $ inits xs, x `S.member` dict]
safeHead xs = if null xs then Nothing else head xs

Now drop into ghci for a quick test to validate the implementation:

*Main> S.fromList . lines <$> readFile "/usr/share/dict/words" >>=
          print . breakWords "sometimesitisjustsosimple"
Just ["sometimes","it","is","just","so","simple"]

Seems to work alright!
This implementation will give us exactly one valid solution (trying first to find the largest possible matches).
If we are interested in all possibilities, this is just a small tweak:

breakWords2 xs dict = splittUp xs [] where
  splittUp [] rs = [Just $ reverse rs] -- done, give back result
  splittUp xs rs = filter isJust $ concat solutions where
    solutions = [splittUp rest (match:rs) | (match,rest) <- findMatches xs dict]

Full source-code is available here.


title-image by Sam Javanrouh (license)