pipes-parse-3.0.0
introduces a new lens-based parsing mechanism. These lenses improve the library in two ways:
They greatly simplify the API
They correctly propagate leftovers further upstream
Overview
The new parsing API consists of three main types, which roughly parallel the three pipes
abstractions:
Producer
s, which are unchanged frompipes
Producer a m x
Parser
s, which are the parsing analog ofConsumer
s:type Parser a m r = forall x . StateT (Producer a m x) m r
Lens'
es betweenProducer
's, which are the parsing analog ofPipe
s:Lens' (Producer a m x) (Producer b m y)
What's neat is that pipes-parse
does not need to provide any operators to connect these three abstractions. All of the tools you need already exist in either transformers
and lens
(or lens-family-core
, if you prefer a simpler lens
alternative).
For example, you connect a Parser
to a Producer
using either runStateT
, evalStateT
, or execStateT
:
+- Result
|
runStateT v
:: Parser a m r -> Producer a m x -> m (r, Producer a m x)
evalStateT
:: Parser a m r -> Producer a m x -> m r
execStateT
:: Parser a m r -> Producer a m x -> m ( Producer a m x)
^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^
Parser Input Leftovers
These correspond to the three possible ways you might want to run a parser:
runStateT
: Return the result and leftoversevalStateT
: Return only the resultexecStateT
: Return only the leftovers
In fact, two of these functions closely parallel conduit
operators:
runStateT
parallels($$+)
evalStateT
parallels($$)
You can also connect Producer
s to Lens'
es using (^.)
or view
:
(^.) :: Producer a m x
-> Lens' (Producer a m x) (Producer b m y)
-> Producer b m y
(^.)
parallels conduit
's ($=)
operator.
If you want to connect Lens'
es to Parser
s, you use zoom
:
zoom :: Lens' (Producer a m x) (Producer b m y)
-> Parser b m r
-> Parser a m r
zoom
parallels conduit
's (=$)
operator.
Finally, you connect Lens
'es to each other using (.)
(i.e. function composition):
(.) :: Lens' (Producer a m x) (Producer b m y)
-> Lens' (Producer b m y) (Producer c m z)
-> Lens' (Producer a m y) (Producer c m z)
(.)
parallels conduit
's (=$=)
operator.
Here's a worked example showing off the new API, including newly-added foldl
integration:
import qualified Control.Foldl as L
import Lens.Family ((^.))
import Lens.Family.State.Strict (zoom)
import Pipes
import Pipes.Parse
import Prelude hiding (span, splitAt)
parser :: Parser Int IO ()
parser = do
-- Attempt to draw a single element
a <- draw
lift (print a)
-- Sum the next 10 input elements
b <- zoom (splitAt 10) (L.purely foldAll L.sum)
lift (print b)
-- Collect all elements less than 15 into a list
c <- zoom (span (< 15)) drawAll
lift (print c)
-- You can nest `zoom`s
zoom (span (< 20)) $ do
d <- zoom (splitAt 3) (L.purely foldAll L.product)
lift (print d)
e <- peek
lift (print e)
-- ... or compose lenses:
f <- zoom (span (< 20) . splitAt 3) drawAll
lift (print f)
-- Print the remaining elements
g <- drawAll
lift (print g)
-- Lenses can modify `Producer`s, too
producer :: Monad m => Producer Int m (Producer Int m ())
producer = each [1..] ^. span (< 25)
We get the following output if we connect our parser
to our producer
:
>>> evalStateT parser producer
Just 1
65
[12,13,14]
4080
Just 18
[18,19]
[20,21,22,23,24]
>>>
Leftovers
There is a subtle difference between pipes-parse
and conduit
: pipes-parse
correctly propagates leftovers further upstream whereas conduit
does not. To illustrate this, let's begin from the following implementation of peek
for conduit
:
import Control.Monad.Trans.Class (lift)
import Data.Conduit
import Data.Conduit.List (isolate, sourceList)
peek :: Monad m => Sink a m (Maybe a)
peek = do
ma <- await
case ma of
Nothing -> return ()
Just a -> leftover a
return ma
peek
attempts to draw a value using await
and then undraws the value using leftover
before returning the result. peek
will work correctly when used like this:
source :: Monad m => Source m Int
source = sourceList [1, 2]
sink1 :: Show a => Sink a IO ()
sink1 = do
ma1 <- peek
ma2 <- peek
lift $ print (ma1, ma2)
If we feed source
to sink1
, both peek
s will return Just 1
, since the first peek
undraws the 1 to make it available for the second peek
:
>>> source $$ sink1
(Just 1,Just 1)
But what happens if we delimit the first peek
using isolate
, which only allows a fixed number of values to flow through? This is a common parsing request: run a parser within a subset of the input.
sink2 :: Show a => Sink a IO ()
sink2 = do
ma1 <- isolate 10 =$ peek
ma2 <- peek
lift $ print (ma1, ma2)
However, when you compose two conduits, the downstream conduit discards all leftovers when done. Michael is up front about this in the documentation for (=$)
:
"Leftover data returned from the Sink will be discarded."
There are similar warnings for ($=)
and (=$=)
, all of which discard leftovers from the right component. We can run sink2
to trigger this behavior:
>>> source $$ sink2
(Just 1,Just 2)
The undrawn 1
is irreversibly lost when (=$)
completes, which is why the second peek
reads a 2
instead of a 1
.
The analogous pipes
code gets this correct:
import Lens.Family.State.Strict (zoom)
import Pipes
import Pipes.Parse
import Prelude hiding (splitAt)
parser :: Show a => Parser a IO ()
parser = do
ma1 <- zoom (splitAt 10) peek
ma2 <- peek
lift $ print (ma1, ma2)
producer :: Monad m => Producer Int m ()
producer = each [1, 2]
The pipes-parse
version correctly restores the undrawn 1
so that the second peek
also draws a 1
:
>>> evalStateT parser producer
(Just 1,Just 1)
The magic is in the splitAt
function, which is the pipes-parse
analog of conduit
's isolate
. Compare the source of isolate
(slightly rewritten):
isolate :: Monad m => Int -> Conduit a m a
isolate = loop
where
loop 0 = return ()
loop n = do
ma <- await
case ma of
Nothing -> return ()
Just a -> do
yield a
loop (n - 1)
... to the source of splitAt
(also slightly changed to resemble isolate
):
splitAt
:: Monad m
=> Int
-> Lens' (Producer a m x) (Producer a m (Producer a m x))
splitAt n0 k p0 = fmap join (k (loop n0 p0))
where
loop 0 p = return p
loop n p = do
x <- lift (next p)
case x of
Left r -> return (return r)
Right (a, p') -> do
yield a
loop (n - 1) p'
The internal loop
function in splitAt
corresponds to the internal loop
from isolate
. The extra magic lies within the first line:
splitAt n0 k p0 = fmap join (k (loop n0 p0))
Lens aficionados will recognize this as the dependency-free version of:
splitAt n0 = iso (loop n0) join
This not only instructs the Lens
in how to isolate out the first ten elements (using loop
), but also how to reverse the process and merge the remaining elements back in (using join
). This latter information is what makes leftover propagation work.
Getters
Note that not all transformations are reversible and therefore cannot propagate leftovers upstream. Fortunately, the lens
model handles this perfectly: just define a Getter
instead of a Lens'
for transformations that are not reversible:
getter :: Getter (Producer a m x) (Producer b m y)
A Getter
will only type-check as an argument to (^.)
and not zoom
, so you cannot propagate unDraw
through a Getter
like you can with a Lens'
:
zoom getter unDraw -- Type error
However, you can still use the Getter
to transform Producer
s:
producer ^. getter -- Type checks!
This provides a type-safe way to distinguish transformations that can propagate leftovers from those that cannot.
In practice, pipes
-based parsing libraries just provide functions between Producer
s instead of Getter
s for simplicity:
getter :: Producer a m x -> Producer b m y
... but you can keep using lens-like syntax if you promote them to Getter
s using the to
function from lens
:
to :: (a -> b) -> Getter a b
producer ^. to getter
Lens Support
Some people may worry about the cost of using lens
in conjunction with pipes-parse
because lens
is not beginner-friendly and has a large dependency graph, so I'd like to take a moment to advertise Russell O'Connor's lens-family-core
library. lens-family-core
is a beginner-friendly lens
-alternative that is (mostly) lens
-compatible. It provides much simpler, beginner-friendly types and has a really tiny dependency profile.
Note that pipes-parse
does not depend on either lens library. This is one of the beautiful aspects about lenses: you can write a lens-compatible library using nothing more than stock components from the Prelude and the transformers
library. You can therefore combine pipes-parse
with either lens-family-core
or lens
without any conflicts. This provides a smooth transition path from the beginner-friendly lens-family-core
library to the expert-friendly lens
library.
Laws
The lens approach is nice because you get many laws for free. For example, you get several associativity laws, like the fact that (^.)
associates with (.)
:
(producer ^. lens1) ^. lens2 = producer ^. (lens1 . lens2)
Similarly, (.)
associates with zoom
:
zoom lens1 (zoom lens2 parser) = zoom (lens1 . lens2) parser
... and the trio of evalStateT
/(^.)
/zoom
all associate:
evalStateT parser (producer ^. lens)
= evalStateT (zoom lens parser) producer
Also, lens composition associates, because it's just function composition:
(lens1 . lens2) . lens3 = lens1 . (lens2 . lens3)
You even get the following identity laws for free:
producer ^. id = producer
zoom id producer = producer
f . id = f
id . f = f
However, there is one caveat: many of the lenses in pipes-parse
do not satisfy certain lens laws. Specifically, they do not satisfy these laws:
-- Law violation #1
view lens (set lens x a) /= x
-- Law violation #2
zoom lens $ do x <- m /= do x <- zoom lens m
f x zoom lens (f x)
Law violation #1 arises because I don't know of a lens-like abstraction that type-checks as a Getter
and a Focusing
, but not a Setter
.
However, law #2 directly conflicts with a core pipes-parse
feature, specifically lenses like splitAt
that delimit parsers. Here's why:
zoom (splitAt n) $ do x <- m /= do x <- zoom (splitAt n) m
f x zoom (splitAt n) (f x)
Limiting one parser to n
elements is not the same as limiting its two sub-parsers to n
elements each. So if you use pipes-parse
lenses you cannot rely on zoom
being a monad morphism when doing equational reasoning. This was a tough call for me to make, but I felt that delimiting parsers were more important than the monad morphism laws for zoom
. Perhaps there is a more elegant solution that I missed that resolves this conflict, but I'm still pleased with the current solution.
Generality
Notice how all of these functions and laws are completely pipes
-independent. Any streaming abstraction that has some Producer
-like type can implement lens-based parsing, too, and also get all of these laws for free, including pure streams (such as strict or lazy Text
) or even lazy IO
. Other streaming libraries can therefore benefit from this exact same trick.
Batteries included
Downstream libraries have been upgraded to use the pipes-parse
API, including pipes-bytestring
, pipes-binary
, and pipes-attoparsec
.This means that right now you can do cool things like:
import Lens.Family.State.Strict (zoom)
import Pipes
import Pipes.Parse
import Pipes.Binary
import qualified Pipes.ByteString as ByteString
parser :: Parser ByteString IO ()
parser = zoom (ByteString.splitAt 100 . decoded) $ do
x <- draw -- Draw a decoded Int
lift $ print (x :: Maybe Int)
unDraw 99 -- This undraws the encoded bytes
producer :: Monad m => Producer ByteString m ()
producer = for (each [(1::Int)..]) encode
The above parser
composes two lenses so that it zooms in on a stream of decoded ints that consume no more than 100 bytes. This will transmute draw
to now receive decoded elements and unDraw
will magically re-encode elements when pushing back leftovers:
>>> producer' <- execStateT parser producer
Just 1
>>> evalStateT parser producer'
Just 99
Also, Michael Thompson has released a draft of pipes-text
on Hackage. This means you can parse a byte stream through a UTF-8 lens and any undrawn input will be encoded back into the original byte stream as bytes. Here is an example program show-casing this neat feature:
-- decode.hs
import Data.ByteString (ByteString)
import Data.Text (Text)
import Lens.Family.State.Strict (zoom)
import Pipes
import Pipes.Parse
import qualified Pipes.ByteString as ByteString
import qualified Pipes.Text as Text
-- Retrieve all `Text` chunks up to 10 characters
parser :: Monad m => Parser ByteString m [Text]
parser = zoom (Text.decodeUtf8 . Text.splitAt 10) drawAll
main = do
(textChunks, leftovers) <- runStateT parser ByteString.stdin
print textChunks
-- Now print the remaining `ByteString` chunks
byteChunks <- evalStateT drawAll leftovers
print byteChunks
The unused bytes from the decoded stream get correctly undrawn to the original byte stream!
$ ./decode
Hello, 世界!!!<Enter>
["Hello, \19990\30028!"]
abcdefg<Enter>
<Ctrl-D>
["!!\n","abcdefg\n"]
$
The remainder of the first line is undrawn by the decoder and restored back as the original encoding bytes.
Note that the above example is line-buffered, which is why the program does not output the Text
chunks immediately after the 10th input character. However, if you disable line buffering then all chunks have just a single character and the example wouldn't illustrate how leftovers worked.
The above example could have also been written as a single Parser
:
parser :: Parser ByteString IO ()
parser = do
texts <- zoom (Text.decodeUtf8 . Text.splitAt 10) drawAll
lift (print texts)
bytes <- drawAll
lift (print bytes)
main = evalStateT parser ByteString.stdin
... but I wanted to make the leftover passing explicit to emphasize that the leftover behavior holds correctly whether or not you exit and re-enter pipes
.
Conclusion
The pipes-parse
API lets you propagate leftovers upstream, encode leftover support in the type system, and equationally reason about code with several theoretical laws. Additionally, pipes-parse
reuses existing functions and concepts from lens
and StateT
rather than introducing a new set of abstractions to learn.
Note that pipes-parse
used to have a some FreeT
-based operations as well. These have been moved to a separate pipes-group
library (and upgraded to use lenses) since they are conceptually orthogonal to parsing and I will blog about this library in a separate post.
You can find pipes-parse
on Hackage or Github, and it comes with an extended tutorial.
No comments:
Post a Comment