Wednesday, June 29, 2005

YAPC 2005 Day 2

Day 2

As before, these are my raw notes from the talks attended on day 2 of the YAPC conference.

Apocalypse Now - Perl 6 is Here Today!

Part I

Autrijus Tang attempted to comfort the crowd by pointing out that the transition from Perl 4 to Perl 5 did manage to complete successfully. And by analogy, if that transition took years, we shouldn't lose hope for this one.

Reviewing the features of Perl 4, Autrijius mentioned modules, closures, references and CPAN. As of today, CPAN has more than 8,000 modules by more than 2,500 authors. It also offers a tool for automatic download, testing and installation. CPAN is the greatest thing ever to happen to Perl. However, Perl is not necessarily the greatest thing to happen to CPAN. The Perl5 syntax imposes a tax on CPAN users.

For example, sigils and references don't really behave consistently:

        $s                @a                %h
        $$s                $a[0]                $h{'!'}
        $s->foo()        :-(                :-(

One explains to a newbie that the '$' sigil indicates a singleton, whereas '@' indicates a multiplicity, and '%' indicates a multiplicity tagged with names. But when dereferencing a reference, the '$' operator doesn't indicate anything about cardinality.

As another example, there is considerable forced redundancy. The line "my $self = shift' may be as much as 10% of your code, if you write nice short subroutines like you should. The book "Higher-Order Perl" offers all sorts of fancy technology, but about 50% of it is nothing but bookkeeping.

We're used to this "language tax", because the language is useful enough to make us tolerate the cost. But in Perl 6, the tax is not necessary. In the case of sigils above, the situation resolves to:

        $s                @a                %h
        $$s                
        <<crap, he switched slides too fast>>

Another savings pertains to "perl" the runtime engine, as opposed to "perl" the language. XS modules in Perl 5 are difficult to write and to maintain. Backward compatibility prevents us from cleaning up that situation. One engineer likened this situation to a game of Jenga: we have a convoluted pile of sticks, and we enhance it by piling more on top while creating holes lower down. All the guts of Perl have multiple applications because we're forced to keep drawing sticks from elsewhere in the pile.

Inventive CPAN authors began creating Perl 5 dialects, using source filters created with tools like Filter::Simple to build a "preprocessor" stage. These dialects are mutually incompatible. Source filters are guaranteed not to work together, because each expects a strange dialect as its input.

Another problem in Perl 5 is that the regex syntax imposes an arbitrary limit on what can be done. Similarly with operators and other aspects of the grammar. The Perl regex engine is not re-entrant, so modules built on them are guaranteed to be fragile.

OO and functional code unfairly handicaps golfers. Golfers always converge on line-based text processing with y// and s// as their two main operators, because the syntax becomes too verbose elsewhere in the language. Unfortunately, that affects regular programmers instead of golfers, because we always take the path of least resistance.


Fast Forward to Perl 6

Perl 6 is faster: soft, incrementally typed, with program-wide optimizations. Stronger types can be retro-fitted to a program. It's also friendlier: you can call Perl5, C, Tcl or Python libraries directly. It's also easier: common usage patterns are dramatically shortened. You can even read a one-liner with comprehension! It's stronger, because it includes support for OO, Functional and Data-driven styles. It's leaner because it offers sane semantics instead of endless workarounds (that have become entrenched idioms, of course).

Autrijus then demonstrated a perl 6 script for uploading modules to CPAN. It was written to be interpreted using pugs, and has some workarounds for perl6 weaknesses by evaling some code in a perl5 context. You can run all your perl5 code under perl6 by writing a perl6 script that pulls it in and runs it. But before you laugh, remember that the early perl5 scripts still used perl4 libraries, because there wasn't a body of perl5 libraries available. The first perl6 scripts will behave similarly.

Today the pugs interpreter cheats by linking in libperl, so of course it works. As an aside, it is also subject to perl5 garbage collection bugs as well.

Going the other way, you can imbed perl6 code in perl5 scripts by invoking the "use pugs" pragma in your perl5 code. You can turn off the perl6 behavior with the "no pugs" pragma. That works today (but, not very well). Sicne it uses Inline::Pugs and source filters, it won't play nicely with source filters. It also invokes a pugs process and shuffles code back and forth to it, so you can't use it with things like DBI. Don't use it in production today. Folks are trying to fix it up so it works more reliably--stay tuned for the second part of the talk. You can get the pugs stuff at http://www.pugscode.org/.

In response to a question, Autrijus replied that he has no idea when the perl6 camel book will come out. There are two books on the market already--the only problem being that 50% of the material is outdated, and you don't know which 50%. The questioner seemed a bit frustrated and kept demanding some hint when perl6 will be production quality. He turned the question around beautifully by suggesting that the questioner could do a great deal to hasten the day.

Pugs was started on Feb 1, so it's only five months old. It started as an exercise based on "Types and Programming Languages", which is another recommendable book, along with "Higher-Order Perl". The exercise at the end of the book said, "Implement a small language that you know of." (There was general laughter at this point.) The first cut took only two days, because it was written in Haskell. Haskell is optimized for compiler writing. As an aside, Autrijus commented that Perl 5 closures leak memory horribly. Switching to lazy references resulted in random seg-faults. These problems are what drove him to use Haskell.

Autrijus summarized the progress to date, but it was a list of test scripts and such, so I didn't write it all down. Recently, someone created a pugs-based IRC bot called evalbot, so IRC users can test evaluate perl6 commands. The most popular command seems to be 'system("rm -rf /")', but "we aren't completely mad, so that doesn't work." Haskell makes it easy to identify unsafe commands, so a safe mode was created that prevents users from doing things like "print" or "system".

Pugs has over 100 committers, mostly because Autrijus hands out committer permissions to passers by. There have been over 5,000 revisions committed. There seems to be a positive feedback cycle, because when someone commits something, they seem to commit to something. Autrijus's own contributions are falling off over time--currently, below 40% of commits. His main function is maintaining a pugs journal and praising contributors.

The "State of the Carrot" speech claimed that pugs covers 80% of Perl 6 semantics, but that's wishful thinking. To a first-order approximation, some 30% of the semantics is covered. There are 7,000 automated tests today, with another 15,000 anticipated.


Part II

If you're interested in contributing to pugs, the first rule is: to report a bug, write a test. Test.pm is really simple to use, if Autrijus does say so himself. A pretty little graph created by Test::TAP::HTMLMatrix that shows passed tests (green), failed tests (red), todos (dark green), etc. One cute aspect of this tool is that, if tests == bug reports, and tests are added to the end of a test script, then red marks at the end of the graph generally represent new bugs.

One attendee asked about the problem of submitted tests that are wrong, or that don't test what they purport to test. Autrijus replied that it isn't common, but it isn't rare either--perhaps 10% of submitted tests have some kind of problem. In the course of maintaining his journal, Autrijus reviews every patch. Others perform code reviews in the case of their work.

Current efforts in pugs concern OO. Introspection and roles are not underway yet. Once the OO features are working, work will focus on rules. Closures don't work yet.

Some benefits of Perl 6:

        * Simplicity
        * Clarity--different constructs look different, such as string eval and block eval (the latter being renamed "try").
        * Sub names are clearer--for example, several clear names replace multi-purpose "length"
        * Brevity: the dot notation saves keystrokes over the arrow notation
        * Implicit parameters are removed (such as $a and $b in sorts)
        * Operator overloading
        * Method overloading via type-based dispatch
        * Some new operators like "yada yada yada" and "defined or" ("$arg //= 3" means "$arg = 3 unless defined $arg")
        * Operator chaining (using transitivity)
        * "Hyper" operators that apply to an entire array
        * Reduction operators that collapse a list by means of an operation
        * Junctions are useful for situations in which complex conditionals are used
        * Type globs are history
        * Smart match (~~) replaces (=~) but has extra coolness like "$a ~~ any(<1, 2, 3>)" (not sure of exact syntax)
        * Exception handling is unified under $!

Some people are distressed that dot as string concatenator is replaced with the tilda. But don't worry: interpolation is so darn good, that the operator is practically obsolete. One attendee complained because he uses ".=" all the time.

Perl 6 has switch statements! They're coded using "given / when".

Problems with Pugs:

        * Can't emit compile-time errors
        * No support for warnings or strictness
        * Slow -- dispatch is 100x slower than perl5
        * <<missed the last one>>

There are also parsing problems due to the ambiguity of Perl and the difficulty of optimizing parrot assembly, so PIL, the Perl Intermediate Langauge, was invented. PIL has far fewer node types (namely 16 at the moment). The result is a sort of "diet perl 6", perl6 like with no syntactic sugar left. PIL is still under development, so it isn't finalized, but it is stable. Other languages may or may not be translatable into PIL.

Note that in the "real" perl 6, the compiler may well not be self-hosting. Among other reasons, proponents of other languages may not like haivng a compiler written in Perl. So Perl/Parot is not the same as C#/.Net. Pugs is entirely different. It wants to become self-hosting, because it's cool to be self-hosting, and it's also desirable to run perl 6 on the perl 5 vm, or to compile to C or some other backend. Self-hosting yields benefits:

        * Well-defined semantics (or it won't work)
        * Less chance of misfeatures like "my $x if 0"
        * Eliminates the low-level black box (such as XS or parrot assembly)
        * Proves that Perl 6 can handle complex tasks like compiler writing--Perl 5 was weak in this area


Status of the Perl 6 Compiler

Patrick Michaud followed up Autrijus's report on pugs with the status of the "official" compiler. Perl 6 embraces more than the Perl six language itself. Interoperability with other languages is also a high priority, which involves a runtime environment compatible with multiple languages (Parrot), a suitable suite of development tools, and a platform for code development. As Larry said, we both should and should not implement Perl 6 in Perl 6.

In short, Perl 6 isn't there yet. The specs are firming up, but still incomplete. Deficiencies in parrot are impacting the compiler now.


Parrot in Detail

What Parrot Is

Parrot is a multi-language VM intended as the interpreter for Perl 6. It started as an April fool's joke that got out of hand. VMs get you several things: platform independence; impedance matching; high-level base platform; good target for a particular class of languages. The major components are the parser, compiler, optimizer, and interpreter.

The Parser converts source code into an abstract symbol tree. The Perl 6 parser can be overridden: new tokens can be added; existing tokens can have their meaning changed; languages can be swapped out. All of this will be accomplished using Perl 6 grammars. Perl 6 grammars are immensely powerful. They're a combination of perl 5 regexes, lex and yacc.

The Compiler turns AST into bytecode. Like the parser, it's overridable in perl 6. Conceptually, it's just a fancy regex. It doesn't perform any optimizations at all. These tools help in building compilers, which is a standard task but involves a lot of work. There isn't much in the way of complexity here. Even lisp programmers won't pretend to have invented it.

The VM is a "bytecode engine" based on 32-bit words. Programs are (or at least can be) precompiled and then loaded from disk. USually no transformation at all is needed at runtime. Most opcodes can be overridden, and the bytecode loader can be overridden. Dynamic opcodes can be used for cool things like loading uncommon functions only on demand.

The VM is register based, having four sets of 32 registers: Integer, String, Float and PMC. The registers are a fast set of temporary locations--all opcodes operate on registers. There are also six stacks: one for each set of registers; one generic stack and one call stack. Stacks are segmented, but have no size limit.

The strings data type is a high-level data type, consisting of a buffer, a length field, and a set of flags. Flags cover things like character set, encoding and length in characters (rather than bytes).

The PMC type wraps the equivalent of Perl 5's variables. Tying, overloading and magic are all swept under that rug, as are arrays, hashes and scalars. All operations on PMCs use the multi-method dispatch core (MMC). This is a recent change, and simplifies vtables and other things. It improves performance, because most interesting PMCs end up using MMC anyway. All PMCs can potentially be treated as aggregate structures: vtable entries have a _keyed variant, and the vtable must decide whether the key is valid.

Objects divide into two types: reference objects and value objects. So objects may or may not be accessed by reference. This makes it possible to handle objects in the same way as low-level types.

At this point the two speakers began flying through the slides, so it became impossible to keep up for a little while.


Why Parrot?

The biggest deficiency of the JVM and the CLR is that they're designed around static code. Classes can't be changed at runtime in the JVM; existing things can't be fiddled with in the CLR; etc. The parrot runtime will support that sort of perlish flexibility. In addition, a VM permits interoperability between different scripting languages, which takes TMTOWTDI to a new level. Today there are lots of inter-compatibility shims, but they grow like n!, where a common runtime permits n^2 growth.

On the other hand, why not parrot? Why hasn't it been adopted yet? Why is the community so small? The first question is, what constitutes a user of parrot? People running parrot applications are not "parrot users", and more than someone running an applet in a browser is a "java user". The developers are the users. Current major users are Pugs, Tcl (ParTcl) and Perl 5 (Ponie). Folks that gave up on parrot often did so because their stress level was too high. They got tired of the moving target, and other things.

Parrot is a platform, which means that it lives or dies on the apps that others write for it. The project's biggest need is developers. The most important of those contributions are pugs and PGE. Another critical need is to keep the specs and design documents up to date. Another is tests: the best way to record a missing feature is to create a test that checks for it.


Parrot Grammar Engine

Patrick Michaud opened with the remark that Perl regexes have been so successful that of course they're being thrown away. For the purposes of this talk, "rules" are perl 6 regexes, and "regexes" are perl 5 regexes.

A brief overview of the rule syntax comes from "Apocalypse 5", though that document is really out of date now.

        * Capturing is still done with parens
        * Star, + and ? work the same
        * Pipes represent alternatives

On the other hand, modifiers go at the start of the expression rather than the end. The /e modifier has gone away: if you want a code block, stick one in there. The /x option is now the default. There are no more /s and /m modifiers; there are new metacharacters. The :w or :words modifier treats whitespace in the pattern as \s* or \s+. There are also new modifiers like :exhaustive, :keepall, :overlap and :perl5.

Among metacharacters, the dot now matches newlines. The ^ and $ always anchor to the start and end of the string. Double ^^ and $$ match beginnings and ends of lines. The # always introduces a comment. Whitespace is metasyntactic depending on the :w setting. Now \n matches newline in a platform independent way, and \N matches anything but (like the perl 5 dot). Now & is a conjunctive match.

Asked for an example of & in use, he replied, he couldn't give a very clear example. He did remark that ampersands can be changed as long as you like, and every item in the chain eats the same number of characters, or the match is a failure.

Although parens capture, numbering is different.

Now square brackets are non-capturing groups, instead of the ugly glyph. Enumerated character classes are <[...]> instead of [].

Curly braces are now no longer a repetition quantifier--it's an embedded closure that is called at that point in the match.

The repetition specifier is now **{...} for maximal matching and **{..}? for minimal matching.

Scalar variables in a rule expression are matched as literally, not as a sub-pattern. I.e., /$var/ matches the content of $var exactly, not $var interpreted as a pattern.

Array references match as the "or" of the elements in the list.

A hash in a pattern causes a match on the longest matching key. The value can be a closure, a sub-rule, the value 1, or else indicates failure. That's useful for building lexers that match according to the "longest matching token" rule.

Angle brackets are heavily overloaded. For starters, a pair of angle brackets whose first content character is alphabetic is interpreted as the name of a capturing subrule. Preceding that character with ? makes it a non-capturing version. A leading ! indicates a negative match. Enumerated character classes are enclosed by <[...]>. Unescaped hyphens are forbidden, because hyphens are not used to indicate a range--you should use .. instead. A negated character class is given by <-[..]>. Plus/minus characters are used to combine subrules and character classes.

To interpolate a string into a rule, use $var. This eliminates escaping problems, among other things.

Double quotes are used to interpolate into a literal match, whatever the heck that means exactly.

Colons are used to control back-tracking behavior. A single colon forbids the rule engine to backtrack past the colon. Backtracking over a double colon causes the current set of alternations to fail. Backtracking over a triple colon causes the entire rule (or subrule) to fail. Backtracking over <commit> causes the entire match to fail (even if it's located within a rule). The <cut> assertion is like <commit>, but also deletes the matched portion of the string.


Capturing and Subrules

None of this stuff is in synopsis 5, which wasn't clear about how capturing works. It's something like this:

Every invocation of a rule returns a "match" object, which goes into a lexically scoped $/ (instead of perl 5's $0). The boolean value of $/ is true if the match was successful, otherwise false. The string value of $/ is the substring of the pattern matched. The array value of $/ contains the matches for subpatterns in parens. These can also be accessed as $0, $1, etc.

When performing a match, any capturing subpattern generates its own match object, which is stuck in the appropriate place in the parent rule's match object. One consequence is that we can no longer count parens to determine the location of the match object, because the match objects are arranged in a tree. Non-capturing patterns don't impose a capturing scope, so they don't force branching in a tree. But what about quantifiers?

When a pattern is quantified with + or * (but not ?), it produces an array of match objects--one for each match that took place. This is true regardless of the number of matches that occurred--the list can be empty or a singleton. This makes for all sorts of interesting weirdness when you factor non-capturing patterns into the mix:

        / [ \w+ (\s+) ]* / yields:
                $/[0]                An array of match objects
                $/[0][0]        Whitespace of first iteration
                $/[0][2]        Whitespace of second iteration, etc.

Subrules look just like subpatterns, but they capture to a match object's hash with the name of the subrule as the key. So in the pattern /<digit>/, the match is stuck into $/<digit> or $/{digit}. So I gather that wedges can be used for subscripting a hash now. Interesting. (In response to my question, Pat stated that it's "basically a qw".) If the same rule is used more than once, the matches are gathered into an array keyed by the rule name. Sets of named rules can be combined easily to build parsers. Like that ain't bleedin' obvious, since $/ is a bleedin' parse tree!

Each call to a rule generates a coroutine. A coroutine is a subroutine that returns control to the caller, but can be restarted where it left off with its state intact.

Currently the PGE expression parser is a recursive descent parser with thirteen node types (at the moment). The most complicated part of the parser is the handling of groups and subrules.

Using PGE

PGE builids as part of parrot. To run it, you want to load the PGE module and use p6rule. He went past quickly with the screen, so I'll need to check the details online.

A grammar is any class derived from PGE::Rule that contains rule methods. Thus one can create a new grammar in a straightforward way. Subrules, on the other hand, aren't necessarily rules. Any subroutine that returns a Match PMC with some appropriate attributes can be used as a subrule. This can be used as a path to creating parsers that don't operate by recursive descent, or at least don't use it for particular subsets of the language.

No comments: