Thumbs down version

November 7th, 2011

It works

The last few weeks, I have been finishing the first version of the posTagger. And I’m thrilled to announce that it works. The posTagger currently:

  • Analyses a training file and builds a list with tokens and their most frequent tag.
  • Tags the tokens in the patching file. Unknown tokens are tagged as nouns (“NN”).
  • Builds a list of possible transformation rules to correct tags, based on the patching file.
  • Finds the most effective transformation rule for a given state, changes the state of the patching file using that rule and calculates the percentage of correct tags.
  • Repeats this last step until no more improvement is possible.

There are 14836 rules in the transformation list. It takes quite a while to find the most effective transformation rule every time and it would thus take days or weeks to run the application. This is countered by isolating the 100 and 1000 most frequent transformation rules respectively. We then calculate the most effective rule in the list and change state. This yields the following results (expressed in percentage of correctly tagged tokens in the patching file):

  • basic tagging with the frequency list: 87.66183 %
  • tagging unknown tokens as nouns: 88.5266 %
  • try the 100 transformation rules: 90.993805 % in 30 rules
  • try the 1000 transformation rules: 92.19936 % in 106 rules.

I stopped running the posTagger at state 106 because it took way too long, so there is still room for improvement there. The basic functionality of the tagger seems to work, and the following weeks I will try to find a solution to the following issues:

  1. What is an efficient number of transformation rules to start off with? A list of 100 rules runs fast but results in fewer corrections, while a list of 1000 rules runs slow but returns more corrections. What is the ideal number here? I need a better view on the results in each state to determine this.
  2. Are some rules applied twice? If not, maybe we can delete each rule we used from the list and let the tagger run faster as it progresses.
  3. I only implemented three patch templates of Brill’s original paper (‘tag before’, ‘tag after’, ‘tag before and tag after’). I should implement all of them.

One More Thing

Oh… Here’s Jigsaw Falling Into Place (Thumbs Down Version).
I started on my new job a few days ago. From Steve Jobs’ Stanford commencement speech:

“Your work is going to fill a large part of your life, and the only way to be truly satisfied is to do what you believe is great work. And the only way to do great work is to love what you do. If you haven’t found it yet, keep looking. Don’t settle. As with all matters of the heart, you’ll know when you find it. And, like any great relationship, it just gets better and better as the years roll on. So keep looking until you find it. Don’t settle.”

Backtrack to the horizon, you Road Runner!

October 4th, 2011

Do the backtrack

Whenever exploring the Canadian wild, you should sing to scare off bears, wolves, cougars and squirrels. You should also backtrack now and again, just to make sure that you are where you think you are.

Brill’s tagger is not exactly a prime example of Canadian scenery and I do not sing while programming, but I did backtrack my trail of the last few weeks. I wrote functions to export two csv files:

  1. the list of tokens and their most frequent tags, distilled from the training file (view)
  2. the list of the transformation rules and their frequency (view).

I just thought that they might come in useful as landmarks when debugging or exploring the horizon of natural language processing.

Event Horizon

We now have a list of all transformation rules, so let’s transform like crazy! Where is Optimus Prime?

Well… It’s not quite that simple. When applying a transformation rule, it will correct errors in the proposed tags. However, it will also cause a number of new errors. The most successful rule is thus the one with the greatest net improvement, calculated by subtracting the caused errors from the corrected ones.
After finding the most effective rule, we apply it (“change state”) and look for the best rule again. After which we change state, look for the best rule for that state, change state and… You get the idea.

This procedure looks simple, but with 14.836 rules it better be fast. Like Road Runner fast. And we should always check whether there still is a net improvement. We could limit the set to the 100 most common rules, but this blog is also a zen exercise in dealing with huge amounts of data. The 100 most common rules is for cowards. For the time being :-)

So let’s face the horizon and just start walking, shall we?

The beft n-grams. Ever.

September 21st, 2011

A few months ago, I was working on n-grams in the Brown Corpus and I was wondering about the speed of my Haskell application. And that seemed so important. And then some people from Harvard, MIT, the American Heritage Dictionary, the Encyclopedia Britannica, the Google got together and did this. 5 million books. 2 billion n-grams. And you can play with them at Google Labs. And you can download the datasets. Looking for data? This is a cookie jar the size of the moon.

And I, for one, welcome our new n-gram overlords.

Thanks for the link, KrisDS!

Va, vis, deviens

September 19th, 2011

Instead of writing code like crazy, I’m reading documentation on Brill’s tagger. The first interesting text is Eric Brill’s dissertation A Corpus-Based Approach to Language Learning, where he describes a method for transformation-based tagging. It seems that a few of my questions are being answered in his thesis, so I will work my way through it before posting back to you.

When God Created the Coffee Break

September 6th, 2011

I haven’t been posting a lot lately, but I have been working on the POS tagger. Trust me :-). The last few weeks, I put emphasis on rewriting the tagger built by Daniël de Kok and Harm Brouwer (www.nlpwp.org, based on an article by Eric Brill). I tried to understand the functions and capture them in my own words. In the process, I checked my results against theirs, such as the percentage of correct/empty/incorrect tags.

All results were identical, until today. There seems to be a slight problem with the nlpwp tagger. Daniël de Kok and Harm Brouwer roughly do the following:

  1. Create a model of the most frequent word/tag couples based on a tagged training file.
  2. Run the model on an untagged patch file.
  3. Tag all unknown words in the patch file with “NN”. Then check it against the tagged version of the file.

The logical next step would be to extract transformational rules from the patch file to “patch” the model, as has been done in the original article. Instead, they seem to run the model against the training file, tag unknown words with “NN” and look for transformation rules. This seems to be a bug and it is odd for two reasons.
Firstly, there is no reason to run the model against the data it has been trained on. You know that the model contains every word in the data because its contents was obtained from the data: you’re moving in a circle. As a consequence, there is no need to tag unknown words with “NN” because there are no unknown words (cf. nlpwp.org itself, under 7.3. Evaluation).
Secondly, extracting the transformation rules from this file will neglect the possibility of unknown words being tagged with “NN”. I suspect they will not work that well, which still has to be tested.

Instead, I think one should extract transformational rules from the patch file, get the most efficient ones, and then test the whole tagger on a third untagged file. I guess that the results from this test will give a good idea of the tagger’s strength. In fact, this has already been done by Eric Brill, the writer of the original article, who created the Brill tagger.

My Haskell version of the tagger will use three files (as described in the original article): a training file, a patch file and a test file, containing 90 %, 5 % and 5 % of the Brown Corpus, respectively.

I’ll keep you posted.

Travel is Dangerous

August 11th, 2011

I haven’t posted something for a while, but I have been programming and rewriting parts of the posTagger from www.nlpwp.org:

  • the tagger now works with three different files (training, testing, evaluation)
  • it is trained based on the frequency of tags
  • and adds the most common tag (“NN”).

Next step is the implementation of transformational rules, but I have to dive into the Data.List.Zipper package first. There is also a small difference in statistics from the nlpwp.org version that I have to check.

posTagger   
  1. import qualified Data.Map as M
  2. import qualified Data.List as L
  3. import Data.List.Zipper as Z
  4. import Data.Maybe as DM
  5. import System.Environment (getArgs)
  6. import System.IO
  7.  
  8. {-----------
  9. DATA TYPES
  10. -----------}
  11. type Token = String
  12. type Tag = String
  13. type TokenTag = (Token, Tag)
  14.  
  15. {----------------
  16. IO: dealing with the outside world
  17. -----------------}
  18. main = do
  19. (command:args) <- getArgs
  20. case command of
  21. "tag" -> tag args
  22. _ -> putStrLn "Error: command does not exist\nPossible commands are:\n-tag filename\n"
  23.  
  24. tag :: [String] -> IO()
  25. tag [trainName, testName, evalName] = do
  26. -- open files and retrieve contents
  27. trainFile <- openFile trainName ReadMode
  28. trainContents <- hGetContents trainFile
  29. testFile <- openFile testName ReadMode
  30. testContents <- hGetContents testFile
  31. evalFile <- openFile evalName ReadMode
  32. evalContents <- hGetContents evalFile
  33. let model = tokenMostFreqTag $ tokenTagFreqs $ map (toTokenTag '/') $ words trainContents
  34. -- test frequency tagging and print results
  35. putStrLn "***Frequency tagging***"
  36. let freqTags = freqTest model $ words testContents
  37. let evalTokenTag = map (toTokenTag '/') $ words evalContents
  38. let statFreqTags = compareMaybe evalTokenTag freqTags [0,0,0]
  39. prettyPrint statFreqTags
  40. -- test frequency tagging + base tagging and print results
  41. putStrLn "***Base tagging***"
  42. let baseTags = baseTest freqTags "NN"
  43. let statBaseTags = compareTag evalTokenTag baseTags [0,0]
  44. prettyPrint statBaseTags
  45. hClose testFile
  46. hClose evalFile
  47. test _ = do
  48. putStrLn "Error: the tag command requires three arguments: tag trainName testName evalName"
  49.  
  50. {-------------------
  51. MODEL TRAINER
  52. -------------------}
  53. -- split the tokens and the tags
  54. toTokenTag :: Char -> String -> TokenTag
  55. toTokenTag separator string =
  56. let (token, tag, _) = rsplit_ separator string
  57. in (token, tag)
  58.  
  59. rsplit_ :: (Eq a) => a -> [a] -> ([a], [a], Bool)
  60. rsplit_ separator = foldr (splitBool separator) ([],[],False)
  61. where
  62. splitBool separator letter (token, tag, True) = (letter:token, tag, True)
  63. splitBool separator letter (token, tag, False) | letter == separator = (token, tag, True)
  64. | otherwise = (token, letter:tag, False)
  65. -- Step 1: Frequency
  66. -- Calculate the frequency of tags for a token
  67. tokenTagFreqs :: [TokenTag] -> M.Map Token (M.Map Tag Int)
  68. tokenTagFreqs = L.foldl' countWord M.empty
  69. where
  70. countWord map1 (token, tag) = M.insertWith (countTag tag) token (M.singleton tag 1) map1
  71. countTag tag _ map2 = M.insertWith (\ newFreq oldFreq -> oldFreq + newFreq) tag 1 map2
  72.  
  73. -- Find the most frequent tag for a token
  74. tokenMostFreqTag :: M.Map Token (M.Map Tag Int) -> M.Map Token Tag
  75. tokenMostFreqTag = M.map (fst . M.foldlWithKey findMax ("NIL", 0))
  76. where findMax acc@(_, maxFreq) tag freq
  77. | freq > maxFreq = (tag, freq)
  78. | otherwise = acc
  79.  
  80. -- Step 2: add default tag
  81. -- add default tag to all tokens without a tag (probably"NN")
  82. defaultTag :: Tag -> (Token, Maybe Tag) -> (Token, Tag)
  83. defaultTag baseTag (token, Nothing) = (token, baseTag)
  84. defaultTag baseTag (token, Just tag) = (token, tag)
  85.  
  86. {-------------
  87. MODEL TESTER
  88. --------------}
  89. lookupToken :: M.Map Token Tag -> Token -> Maybe Tag
  90. lookupToken myMap token = M.lookup token myMap
  91.  
  92. freqTest :: M.Map Token Tag -> [Token] -> [(Token, Maybe Tag)]
  93. freqTest freqModel wordList = L.zip wordList tagList
  94. where tagList = map (lookupToken freqModel) wordList
  95.  
  96. baseTest :: [(Token, Maybe Tag)] -> Tag -> [(Token, Tag)]
  97. baseTest tokenTag baseTag = map (defaultTag baseTag) tokenTag
  98.  
  99. {-------------
  100. MODEL EVALUATOR
  101. ---------------}
  102. compareTag :: [(Token, Tag)] -> [(Token, Tag)] -> [Int] -> [Int]
  103. compareTag [] [] [number, correct] = [number, correct]
  104. compareTag _ [] [number, correct] = [number, correct]
  105. compareTag [] _ [number, correct] = [number, correct]
  106. compareTag (x:xs) (y:ys) [number, correct] = compareTag xs ys [(number + 1), (correct + (eval x y))]
  107. where eval tagOriginal tagComputed | tagOriginal == tagComputed = 1
  108. | otherwise = 0
  109.  
  110. compareMaybe :: [(Token, Tag)] -> [(Token, Maybe Tag)] -> [Int] -> [Int]
  111. compareMaybe [] [] [number, correct, empty] = [number, correct, empty]
  112. compareMaybe _ [] [number, correct, empty] = [number, correct, empty]
  113. compareMaybe [] _ [number, correct, empty] = [number, correct, empty]
  114. compareMaybe (x:xs) (y:ys) [number, correct, empty] =
  115. compareMaybe xs ys [(number + 1), (correct + (evalJust x y)), (empty + (evalNothing x y))]
  116. where
  117. evalJust (_,tagOriginal) (_,tagComputed) | tagOriginal == (DM.fromMaybe "NIHIL" tagComputed) = 1
  118. | otherwise = 0
  119. evalNothing (_, tagOriginal) (_,tagComputed) | tagComputed == Nothing = 1
  120. | otherwise = 0
  121.  
  122. prettyPrint :: [Int] -> IO()
  123. prettyPrint [number, correct, empty] = do
  124. putStrLn ("Total tagged: " ++ (show $ (((number - empty) `myDiv` number) * 100)) ++ " %")
  125. putStrLn ("Total tagged correctly: " ++ (show $ ((correct `myDiv` number) * 100)) ++ " %")
  126. putStrLn ("Total empty: " ++ (show $ ((empty `myDiv` number) * 100)) ++ " %")
  127. putStrLn ("Total tagged known: " ++ (show $ ((correct `myDiv` (number-empty)) * 100)) ++ " %")
  128. prettyPrint [number, correct] = do
  129. putStrLn ("Total tagged correctly: " ++ (show $ ((correct `myDiv` number) * 100)) ++ " %")
  130.  
  131. myDiv :: Int -> Int -> Float
  132. myDiv n1 n2 = (fromIntegral n1) / (fromIntegral n2)

When you run this in the Terminal, it goes a little something like this:

koenroelandt$ ./posTagger tag train.txt test.txt eval.txt
***Frequency tagging***
Total tagged: 95.773186 %
Total tagged correctly: 87.66148 %
Total empty: 4.226818 %
Total tagged known: 91.530304 %
***Base tagging***
Total tagged correctly: 88.52606 %

The weight

July 5th, 2011

A short post again. I wrote functions that add the default “NN” tag to the model if the token does not contain a tag. More precisely: when the tag value is Nothing instead of Just a tag. I also copied some functions that should run the model against an untagged test file. The distinction between modeling and testing proves a bit artificial, I might change it in the future.

As an aside, the following days will be filled with live jazz music, so don’t be surprised if I don’t publish a post very soon.

import qualified Data.Map as M
import qualified Data.List as L
import Data.List.Zipper as Z
import Data.Maybe as DM
import System.Environment (getArgs)
import System.IO
 
{-----------
DATA TYPES
-----------}
type Token = String
type Tag = String
type TokenTag = (Token, Tag)
 
{-----------}
{-------------------
MODEL TRAINER
-------------------}
-- split the tokens and the tags
toTokenTag :: Char -> String -> TokenTag
toTokenTag separator string =
   let (token, tag, _) = rsplit_ separator string
   in (token, tag)
 
rsplit_ :: (Eq a) => a -> [a] -> ([a], [a], Bool)
rsplit_ separator = foldr (splitBool separator) ([],[],False)
   where
      splitBool separator letter (token, tag, True) = (letter:token, tag, True)
      splitBool separator letter (token, tag, False) | letter == separator = (token, tag, True)
                                                     | otherwise = (token, letter:tag, False)
-- Step 1: Frequency
--  Calculate the frequency of tags for a token
tokenTagFreqs :: [TokenTag] -> M.Map Token (M.Map Tag Int)
tokenTagFreqs = L.foldl' countWord M.empty
   where
      countWord map1 (token, tag) = M.insertWith (countTag tag) token (M.singleton tag 1) map1
      countTag tag _ map2 = M.insertWith (\ newFreq oldFreq -> oldFreq + newFreq) tag 1 map2
 
-- Find the most frequent tag for a token
tokenMostFreqTag :: M.Map Token (M.Map Tag Int) -> M.Map Token Tag
tokenMostFreqTag = M.map (fst . M.foldlWithKey findMax ("NIL", 0))
   where findMax acc@(_, maxFreq) tag freq
           | freq > maxFreq = (tag, freq)
           | otherwise = acc
 
-- Step 2: add default tag
-- add default tag to all tokens without a tag (probably"NN")
defaultTag :: Tag -> (Token, Maybe Tag) -> (Token, Tag)
defaultTag baseTag (token, Nothing) = (token, baseTag)
defaultTag baseTag (token, Just tag) = (token, tag)
 
{-------------
MODEL TESTER
--------------}
lookupToken :: M.Map Token Tag -> Token -> Maybe Tag
lookupToken myMap token = M.lookup token myMap
 
freqTest :: [Token] -> M.Map Token Tag -> [(Token, Maybe Tag)]
freqTest wordList freqModel = L.zip wordList tagList
   where tagList = map (lookupToken freqModel) wordList
 
baseTest :: [(Token, Maybe Tag)] -> Tag -> [(Token, Tag)]
baseTest tokenTag baseTag = map (defaultTag baseTag) tokenTag

LittleByLittle

June 30th, 2011

I wrote the first functions for the POS-tagger, all of them are based on the work at www.nlpwp.org. The types are the main difference here: Harm Brouwer and Daniël de Kok create a new data type, while I just use type synonyms. That way, the TokenTag type is basically a tuple containing two Strings. It just seemed a bit more intuitive, but we’ll see. I also removed the rsplit function because it ended up doing exactly the same as the toTokenTag function.

You ‘ll find the code below, it seems to work in ghci (I only did a short test).

  1. import qualified Data.Map as M
  2. import qualified Data.List as L
  3. import Data.List.Zipper as Z
  4. import Data.Maybe as DM
  5. import System.Environment (getArgs)
  6. import System.IO
  7.  
  8. {-----------
  9. TYPES
  10. -----------}
  11. type Token = String
  12. type Tag = String
  13. type TokenTag = (Token, Tag)
  14.  
  15. {-------------------
  16. POS MODEL TRAINER
  17. -------------------}
  18. -- split the tokens and the tags
  19. toTokenTag :: Char -> String -> TokenTag
  20. toTokenTag separator string =
  21. let (token, tag, _) = rsplit_ separator string
  22. in (token, tag)
  23.  
  24. rsplit_ :: (Eq a) => a -> [a] -> ([a], [a], Bool)
  25. rsplit_ separator = foldr (splitBool separator) ([],[],False)
  26. where
  27. splitBool separator letter (token, tag, True) = (letter:token, tag, True)
  28. splitBool separator letter (token, tag, False) | letter == separator = (token, tag, True)
  29. | otherwise = (token, letter:tag, False)
  30.  
  31. -- Calculate the frequency of tags for a token
  32. tokenTagFreqs :: [TokenTag] -> M.Map Token (M.Map Tag Int)
  33. tokenTagFreqs = L.foldl' countWord M.empty
  34. where
  35. countWord map1 (token, tag) = M.insertWith (countTag tag) token (M.singleton tag 1) map1
  36. countTag tag _ map2 = M.insertWith (\ newFreq oldFreq -> oldFreq + newFreq) tag 1 map2

The Plan

June 28th, 2011

My imitation of the POS-tagger at www.nlpwp.org was finished today and everything seems to work. Seems. The problem is that the nlpwp tagger was written for the sake of instruction and is not supposed to be a working application. The success of frequency-based tagging is evaluated, but the performance of the transformation-based tagging remains unclear, especially between consecutive correction rules.

The Plan is to write my own flavour of the nlpwp tagger with some small differences. I’m still just brainstorming, so bear with me:

  • use three different versions of the Brown Corpus
    1. a tagged file to build a model (trainFile)
    2. an untagged file to test the model upon (testFile)
    3. a tagged file to evaluate the tags generated using the model on the testFile (evalFile). This is the same as the testFile, except that it’s been tagged
  • maintain a clear difference between training, testing and evaluating the tagger in the Haskell code
  • try to use a data type consistently for the list of tokens/tags (instead of a list, Zipper, Map…)
  • check whether it is possible to use the tagger for several corpuses.

I’m not saying that my way will be better, but I think I have to try this to get a better understanding of how the tagger works and what its weak spots are. To be continued…

The Dragon Reborn

June 27th, 2011

It’s been a while since I wrote a post for this weblog, and there are some (good) reasons:

  • I’ve been very busy, both at work and at home.
  • I relapsed into an old habit: reading The Wheel of Time. Parents, beware! Do not let your children read this stuff!
  • I took a stroll into a history of computational linguistics in the Low Countries by Leonoor van der Beek. It’s worth a read if you know Dutch and are looking for an interesting overview of the major breakthroughs and hurdles in this young discipline.
  • I’m also looking into formal semantics, although this project is quite premature.
  • I got tired of posting code (and making sure it looked alright). I installed a plugin last week that should make life a bit easier.

Yes, my attention is shifting. I already made clear from the start that the goal of this blog was unclear, so I’m just wandering about to see which subjects fascinate me. I also worked on the POS tagger from www.nlpwp.org and managed to display the 10 most frequent correction rules for the Brown Corpus. Next, we will use these rules on the corpus. I will post the code once I get an idea of the tagger’s success rate with these rules applied to it.