HacPDX is Coming
September 19, 2009
That’s right – HacPDX is less than a week away so REGISTER if you haven’t. Failure to register means you might not have network access or even a chair!
I’ve planned to work on networking but I’ll be happy to work on anything that interests people and makes progress including networking, crypto, kernel modules, hackage-server, even ARM (assuming an expert attends).
I’ve also heard a number of people mention working on various C Bindings – which is awesome because I suck at marshalling so maybe they can help. Can’t wait to see everyone there and hacking away!
Explicit Parallelism via Thread Pools
March 13, 2009
I’m happy to say that this post is being finished up from my new home in Portland, Oregon! After many years of searching around I’ve finally decided to study at Portland State. Its probably optimisitic of me to think I’ll have more time as a graduate student, but I hope this is the beginning of much more involvement for me in the Haskell community.
Thread Pools from the Control-Engine Package
Control.Engine was recently released on hackage, providing a simple way to instantiate worker threads to split-up the processing of streaming data. Its was originally developed as a spin-off library from my DHT and I’ve since generalized it to cover numerous cases.
Trivial Thread Pools
The trivial module Control.ThreadPool can cover static examples such as a recent question asked on the haskell-cafe:
I have a function that does some IO (takes a file path, read the file, parse, and return some data), and I would like to parallelize it, so that multiple files can be parsed in parallel.
‘Control.ThreadPool’ gives us an easy answer (but not as slick as the map reduce answer the person was asking for on cafe). First we have our typical fluff.
import Control.ThreadPool (threadPoolIO)
import System.IO (openFile, IOMode(..))
import System.Environment (getArgs)
import Control.Concurrent.Chan
import Control.Monad (forM_)
import qualified Data.ByteString.Lazy.Char8 as L
main = do
as <- getArgs
As you can see below, we simply say how many threads we want in our thread pool and what action (or pure computation, using ‘threadPool’) we wish to perform. After that its just channels – send input in and read results out!
(input,output) <- threadPoolIO nrCPU op mapM_ (writeChan input) as -- input stream forM_ [1..length as] (\_ -> readChan output >>= print) where nrCPU = 4 op f = do h <- openFile f ReadMode c <- L.hGetContents h let !x = length . L.words $ c hClose h return (f,x)
And while this does nothing to demonstrate paralellism, it does work:
[tom@Mavlo Test]$ ghc -O2 parLines.hs --make -threaded -fforce-recomp
[1 of 1] Compiling Main ( web.hs, web.o )
Linking web ...
[tom@Mavlo Test]$ find ~/dev/Pastry -name *lhs | xargs ./parLines +RTS -N4 -RTS
("/home/tom/dev/Pastry/Network/Pastry/Module/Joiner.lhs",107)
("/home/tom/dev/Pastry/Network/Pastry/Config.lhs",14)
("/home/tom/dev/Pastry/Network/Pastry/Data/LeafSet.lhs",120)
("/home/tom/dev/Pastry/Network/Pastry/Data/Router.lhs",87)
("/home/tom/dev/Pastry/Network/Pastry/Data/RouteTable.lhs",75)
("/home/tom/dev/Pastry/Network/Pastry/Data/Address.lhs",152)
Control Engine Setup
The thread pools are simple, but what if you need more flexibility or power? What happens if you want to have an up-to-date state shared amoung the threads, or there’s a non-paralizable cheap computation you need to perform before the main operation? The answer is to use Control.Engine instead of Control.ThreadPool. The engine provides managed state, numerous hook location, and an abilty to inject information to mid-engine locations.

Injectors
The inject* calls can bypass the input hooks (injectPreMutator) or bypass everything besides the output hooks (injectPostMutator) – thus creating a ‘result’ that had no corrosponding ‘job’.
Hooks
Hooks are capable of modifying or filtering jobs or results. All hooks are of type state -> a -> IO (Maybe a); its important to note the type can not change and if a hook returns Nothing then the job or result stops there.
Hooks can either be input, pre-mutate, post-mutate, or output. Input and output hooks are ran in series on all jobs or results respectivly; this is intended for low computation tasks that shouldn’t be done in parallel later. Pre and post mutate hooks happen on the (parallel) worker threads before an after the main task, called the mutator.
Mutator
The engine consists of N mutator threads, which is the only operation capable of transforming the jobs into a different (result) type.
State Management
Control.Engine was built with the idea that jobs and state reads were frequent while alterations to the state were rare. A design decision was made to use STM to resolve all contention on state alterations and have a manager watch the TVar for change then bundle those changes in a quicker to read fashion (MVar) for the input, output, and worker threads.
The state provided to the hooks and mutator is always consistent but not guarenteed up-to-date. When modifications to the state occur a transactional variable is modified, which wakes the stateManager; in turn, the state manager updates the state MVar which is read by each thread before processing the next job. In the future IORefs might be used instead of the MVar – all contention is handled by the STM and the only writer for the MVar (future IORef) should be the State Manager.
Web Crawling
Now lets figure out how to help users who need more flexibility using Control.Engine instead of Control.ThreadPool.
MyCatVerbs, from #haskell, suggested a web crawler that uses URls as the job and the mutator (worker) can add all the links of the current page as new jobs while ignoring any URL that was already visited. Lets start!
The imports aren’t too surprising – tagsoup, concurrent, bloomfilter and Control-Engine are the packages I draw on.
module Main where import Control.Concurrent (forkIO, threadDelay) import Control.Concurrent.Chan import Control.Monad (forever, when) import Control.Engine -- Control-Engine import Control.Exception as X import Data.BloomFilter -- bloomfilter import Data.BloomFilter.Hash -- bloomfilter import Data.BloomFilter.Easy -- bloomfilter import Data.IORef import System.Environment (getArgs) import Text.HTML.Download -- tagsoup import Text.HTML.TagSoup -- tagsoup type URL = String data Job = GetURL URL | ParseHTML URL String deriving (Eq, Ord, Show) main = do (nrCPU:url:_) <- getArgs
The library tries to remain flexible which makes you do a little more work but don’t let that scare you! It needs an IO action to get tasks and an IO action that delivers the results. Most people will probably just want a channel, but sockets or files would do just as well.
input <- newChan output <- newChan
Starting the engine is a one line affair. You provide the number of threads, input, output, a mutator function and initial state. In return you are provided with an ‘Engine’ with which you can modify the hooks and use the injection points.
For this web crawler my ’state’ is just a bloom filter of all visited URLs so I’ll keep that in the one hook its needed and declare the engine-wide state as a null – (). For the chosen task the mutator needs a way to add more jobs (more URLs) so as pages are parsed any new URLs can be queued for future crawling; this is handled via partial application of mutator funciton.
eng <- initEngine (read nrCPU) (readChan input) (writeChan output) (mutator (writeChan input)) ()
As mentioned, I’ll use a bloom filter to keep the web crawler from re-visiting the same site many times. This should happen exactly once for each URL and is fairly fast so I’ll insert it as an ‘Input Hook’ which means a single thread will process all jobs before they get parsed out to the parallel thread pool.
let initialBF = fromListB (cheapHashes nrHashes) nrBits []
(nrBits, nrHashes) = suggestSizing 100000 (10 ** (-6))
bf <- newIORef (initialBF,0)
let bfHook = Hk (uniqueURL bf) 1 "Ensure URLs have not already been crawled"
addInputHook eng bfHook
Finishing up main, we print all results then provide an initial URL. Notice we run forever – there’s no clean shutdown in this toy example.
forkIO $ forever $ printResult output writeChan input (GetURL url) neverStop eng where neverStop eng = forever $ threadDelay maxBound
And from here on we’re just providing the worker routine that will run across all the threads and we’ll define the input hook. TagSoup performs all the hard work of downloading the page and parsing HTML. Just pull out the <a href=”…”> tags to add the new URLs as jobs before returning any results. In this example I decided to avoid any sort of error checking (ex: making sure this is an HTML document) and simply returning the number of words as a result.
mutator :: (Job -> IO ()) -> st -> Job -> IO (Maybe (URL,Int)) mutator addJob _ (GetURL url) = forkIO (do e <- X.try (openURL url) :: IO (Either X.SomeException String) case e of Right dat -> addJob (ParseHTML url dat) _ -> return () ) >> return Nothing mutator addJob _ (ParseHTML url dat) = do let !urls = getURLs dat !len = length urls fixed = map (\u -> if take 4 u /= "http" then url ++ '/' : u else u) urls mapM_ (addJob . GetURL) fixed return $ Just (url,len) where getURLs :: String -> [URL] getURLs s = let tags = parseTags s in map snd (concatMap hrefs tags) hrefs :: Tag -> [(String,String)] hrefs (TagOpen "a" xs) = filter ( (== "href") . fst) xs hrefs _ = [] printResult :: (Show result) => Chan result -> IO () printResult c = readChan c >>= print
Filtering out non-unique URLs is just the bloom filter in action.
uniqueURL :: IORef (Bloom URL, Int) -> st -> Job -> IO (Maybe Job)
uniqueURL _ _ j@(ParseHTML _ _) = return $ Just j
uniqueURL bf _ j@(GetURL url) = do
(b,i) <- readIORef bf
if elemB url b
then putStrLn ("Reject: " ++ url) >> return Nothing
else do writeIORef bf (insertB url b, i + 1)
when (i `rem` 100 == 0) (print i)
return $ Just j
Performance
P.S. No serious performance measurements have been made beyond extremely expensive (and trivially parallel) problems, so those don’t count.
14Mar2009 EDIT: Said something about Control.Engine before showing the diagram to make reading smoother.
EDIT2: ThreadPool example shown was a version using MD5, not length – oops! Fixed now.
hsXenCtrl and pureMD5
August 7, 2008
On vacation I found some time to upload the new hsXenCtrl library (0.0.7) and pureMD5 (0.2.4)
The new hsXenCtrl includes the System.Xen module, which is a WriterT ErrorT transformer stack and a brief attempt at ‘Haskellifying’ the xen control library. I find it much more useful for simple tasks like pausing, unpasing, creating and destroying domains. The API is still subject to change without notice as plenty of function are still very ‘C’ like (ex: scheduler / sedf functions).
pureMD5 received a much smaller change – some users noticed the -fvia-c caused compilation headaches on OS X. After removing the offending flag, some benchmarks revealed no measureable difference in speed, so this is an over-due change. OS X users rejoice!
Objections to Lenovo + SuSE
June 10, 2008
I just received my Thinkpad T61p and was eager to try SUSE, seeing as I have never used SUSE before but have used about every other major Linux distribution. Sadly it didn’t pan out:
1) Updating the software via zmd
Sat for over half an hour “resolving dependencies” without telling me anything else (buttons were grayed out). Sorry, but rule number one: never leave the user out of the loop.
2) Installing a basic -devel library (libpng, depends on zlib):
resolve-dependency: 50 sec
transact: 54 sec
update-status: 49 sec
And when I add in zmd updating the list of software, I am talking about a three or four minute process just to install trivial developer libraries such as libpng-devel. I’m sure this is just a configuration issue but it really gave me a sour taste.
3) Using Yast2
A) Common libraries and programs not in SUSEs repo
Many packages weren’t even found, for example ‘wine’ and ‘libjpeg-devel’ turned up zero search results. Perhaps there is some reason the GUI isn’t giving me much, but using yast through the cli doesn’t seem to be any comparison to apt or yum. Am I mistaken in thinking yast2 should help here?
B) Many options not entirely visible
“Installation into Dire” ok, so I know thats “Directory”, but what does it do? Can it help with any of my issues? Why can’t I mouse over it and read it all or see the entire text in a status bar?
4) Upgrading to 10.1 and 10.2
A) Three was no blatant ‘upgrade to 10.1′ button, but I found it and lived, no big deal
B) Upgrading to 10.2, short of downloading openSUSE DVDs, seems to require registration. This is amazingly stupid if so. Even if there is a way to do so without registering why have the road block method be the easiest one to stumble across (via yast2 -> online update)?
C) Registration requires a “Service Tag” which probably came with my laptop (or could it be that I am not entitled to this service?). Being a savvy computer user I know right where to look for service tag numbers… after trying numbers on the laptop, the CD, the CD sleeve, and some on the box I gave up (I couldn’t find any documentation that identified such a number, obviously).
What does all this mean to the average Troll Enthusiast Fanatic Open-source-developer-wanna-be Parasite (TEFOP)?
1) I’m too tired
2) I don’t know how to use SuSE (and don’t care enough to spend much time)
3) My Internet connection must be mush (nope, I’ve been getting 200KBps, all of my HTTP downloads were fine)
And what do I want you to take away?
1) Hardware vendors shouldn’t ship 3 year old OSes (10.0 came out 6Oct2005)
2) More importantly: Software vendors trying to improve on their brand name should not allow older versions of their product to continue shipping on new hardware!
3) Requiring registration is a bad idea, but if it must be so then tell customers where to get the information you are requesting of them.
4) Keep the user informed. I don’t like 4 minute install times for 400KB libraries. I literally found the homepage, downloaded, compiled and installed a program before my single package install finished in two cases.
Business Models and Open Source
April 6, 2008
I’ve always thought the ideal work would include a healthy dose of open source. To this end, I try to make my time wasting activities (slashdot, proggit) worth while by noting the various business models and their success. This is an informal brain dump with extra facts pulled in from Wikipedia – as such you should confirm any information before long term storage in your brain.
For idealistic reasons, the preferred model would be like RedHats. You compose, improve and build open source projects and products, providing them for free and selling the service (consulting, tailored development, support).
The main problem here is that, as a theoretical small business owner, I want to sell products and not services. In addition to the the discreet nature of products (as opposed to continuing duties of services), the constructive feel of a product oriented business seems nice in contrast to the droning of a service oriented business. Here is a question: is the concept of open source at odds with product driven revenue?
The Xen method seemed to be making an open source component and selling closed source support tools. This appeared aimed at preventing RedHat et al. from ’stealing’ the GPL code and owing nothing to XenSource. I say “seemed” because the true aim could have been to get bought out – as XenSource was on 22Oct2007 by Citrix for $500M.
A well known (and often successful) method of dual licensing was employed by TrollTech. TrollTech owned the source to QT which they monotized by selling proprietary licenses in addition to offering the source code under GPL. Like Citrix, on 28Jan2008 Nokia offered (and TrollTech accepted) a $163M buyout.
Like TrollTech, MySQL offered both support and dual licensing. I notice this is referred to as a ’second generation’ open source company on Wikipedia – not sure how common this term is, but I take issue with the idea that the distinguishing factor between second and first generation open source companies is their ability/willingness to sell closed source licenses. MySQL was purchased by SUN on 16Feb08 for $1B… are we seeing a trend yet?
The Jabberwolk
March 21, 2008
Welcome to the Jabberwolk. This is my blog recently moved from sequence.complete.org. Here I will try to stay on the topics of Haskell, Xen, and Linux but anything technical is game and I’ll warn you that politics might appear (but I’ll try to keep that down).
Edit: My old blog is here.