Waystination: June 2005

Thursday, June 30, 2005

YAPC 2005 Day 3

As before, these are my raw notes on the talks I attended at YAPC, day three. Perhaps tomorrow I'll summarize YAPC in a readable, pithy little journal entry.

Where did My Memory Go?

A useful tool for profiling memory use is Devel::Size, which offers both size() and total_size() methods. One does, and one does not, chase references to include their size in the total. If you hand it a reference, it knows to start "one level down" in reporting. It also knows enough not to follow circular references forever. Presenter Dan Sugalski proceeded to demonstrate the toolkit by finding out how large some scalars, arrays and hashes are. If you want to know, this is left as an exercise for the reader.

One note: "foreach (<FILE>)" uses more memory than "while (<FILE>)". Note that assigning a filehandle read to a scalar causes the entire file to be slurped into memory temporarily, so don't do that. (Double check that I caught the scenario correctly.)

Concerning garbage collection, recall that file-scoped lexicals are basically never cleaned up. The GC clears variables when they pass out of scope. If you have something very large, you should undef it using the undef function, not by assigning undef to the variable. If you assign an empty string to a scalar, for example, the storage used by that string is not cleared, in case it's needed. On the other hand, if your variable is going to end up big again, you might want to keep it around without deallocating it--profile, don't speculate. Generally, it's a pain to undef things, but you really should use it when your structure is hogging lots of memory.

Watch out for circular references, since things are only garbage collected when the reference count drops to zero. Either break the reference chain manually, or use weak references. Ise Scalar::Utils to make weak references with the weaken() method. Do note that perl will immediately garbage collect a structure if every reference to it is weak.

Also note that every version of Perl leaks memory. At least, you should use the latest version. Before 5.8, closures leaked. Parameters passed to new threads used to leak. Modifying @_ resulted in leaks before 5.6. Before 5.8.6, lots of ithread shared variables leaked.

The following will determine how much memory is being used by Perl in total.

        $Devel::Size::warn = 0;
        $size = total_size(*::);

You can also use Devel::LeakTrace to find some cases of unfreed variables. It tends to be whiny about globals, which makes it tricky to find the real leaks. It also slows runtime down considerably, so you should only use it in debugging.

Lazy Test Development

Joe McMahon promises to tell us about not only lazy test development, but also the necessary evils incurred in doing that. A common situation is stepping through the debugger to locate a problem, which is also a good time to think about creating a test. Joe created a module, Devel::TestEmbed, that pulls Test::More into the debugger, so you can create tests while debugging. An additional method allows you to save the tests when you're happy.

Building the module involved fiddling with the debugger, which isn't easy to change. It's about nine KLOC. Patching the debugger implies ongoing maintenance. Fortunately, the debugger offers an external interface that lets us write extensions without touching the debugger itself. There are some resources:

        * A .perldb file, included with a "do" into the debugger.
        * afterinit() is called right before the debugger prompt is first printed.
        * watchfunction() lives right inside the debugger's command loop, and is called before each prompt is printed.
        * @DB::typeahead allows you to stuff commands into the buffer
        * @DB::hist lets you look at prior commands.
        * Debugger's eval behavior can also be exploited: unrecognized commands are evaled.

Putting these together, a .perldb module is written that defines watchfunction() and afterinit(), and sets a magical $DB::trace to enable the watchfunction. The afterinit() stacks the "use Test::More qw(no_plan)" into the command buffer, so you don't have to type it. This was necessary to get the Test::More methods into the current namespace--if it were in the .perldb, it would import test methods into its namespace. The watchfunction dynamically imports tdump() into the current namespace in the program being debugged (so it follows you no matter where you are in the program). That's all watchfunction can do, because it runs outside the debugger's command loop.

Portable Perl - how to write code that will work everywhere

Ivor Williams started by mentioning some common misconceptions: although Perl is portable, a perl app may not be. Even if your app doesn't use XS, it may not be portable. At minimum, a motivation for writing portable Perl is that CPAN modules should be written portably. Exceptions usually have their own namespace, such as Win32::.

        * Be lazy--use existing portable modules whenever possible
        * Modularize--make plugins that wrap OS-specific stuff you can't avoid
        * Follow the rules in perldoc and perlport (on which this talk is based)

Filesystem Issues

The obvious portability issue in Perl is filenames. Luckily, you can completely ignore this, because POSIX paths work on Windows and VMS. The only problem with this is that there's no provision for a "volume" or "device" specifier in a path. There are also variations in allowed character sets. There is also the issue of case sensitivity.

The alternative to POSIX is to use native syntax. $^0 will tell you the OS name, so you can do what you must. But you can use the File::Spec module that handles this for you. This is OO, but you can also use File::Spec::Functions to import functions into the namespace.

Other issues include file permissions, which vary per OS, and symbolic/hard links, which are supported differently or not at all on some platforms.

Specifically in VMS, files are stored according to their version. That's why some modules say "while (unlink 'Foo')" for this reason.

Specifically on the Mac, files have resource forks. Ivor is too chicken to talk about them any further.

Invoking the Shell

Just don't do it. Commands vary between platforms, so invoking shell commands won't work portably. Shell globs will also be handled differently per OS. Environment variables can't be relied on either, such as HOME, TERM, SHELL, USER, etc. Even PATH isn't always set.

User Interaction

A script might be started with file descriptors redirected. If you need to interact with the user, you can't count on STDIN, STDOUT and STDERR. Reading from "/dev/tty" is very not portable. There's a better way specified in perlfaq 8. You can use Term::ReadKey for this purpose, though it doesn't successfully disable echo on Windows. A combination of Term::ReadKey and Term::ReadLine can be used to do the trick. Note that Term::ReadLine is a wrapper around either Term::ReadLine::Gnu or Term::ReadLine::Perl. The latter is included in the "CPAN bundle" that the cpan command installs for you.

Communications

Sharing files between machines with possibly different architectures, or communicating over the network, present portability challenges. Complying with some standard is helpful. Line-ending conventions are one example of this. Perl translates "\n" to be correct for the platform on which the script is running, which may not be the convention of the other endpoint.

Sadly you should use binmode for anything that isn't known to be an ASCII file, for portability. It matters on some platforms (though not on UNIX). This will also affect character counts depending on line conventions and applicable character set.

Endianness is an issue as well. Pack and unpack have a "network standard" format, specified with 'n' and 'N', which should be used.

Multitasking and Job Control

Beware of forks and threads, non-blocking I/O, etc. A portable multitasking package like POE should be used instead.

Perl Blue Magic - Creating Perl Modules and Using the CPAN

Famed comedian José Castro returns to the limelight for this talk. CPAN has over 5,000 modules, and over 2,000 active developers. There are ~200 developers with more than 5 modules.

First tip: PICK GOOD NAMES FOR YOUR MODULES! Nobody will use it if they can't find it! Here José gave a few humorous examples of useless and/or strange module names.

A module has lots of junk inside, but you don't have to make it yourself. You can use h2xs and other modules for this purpose. Just do one of these here:

        h2xs -XAn My::New::Module

It creates most of what you need, excluding a License or TODO file. There are other issues with h2xs modules. That's why José recommends Module::Starter instead:

        module-starter --module=My::New::Module --author="Me, Myself" --email="me@me.com"

You can also use ExtUtils::ModuleMaker. It prompts you through the creation process. You can also get help on module-authors@perl.org and modules@perl.org. But whatever you use, you should:

        * Have an idea - make sure it isn't already done
        * Document it - make sure you know how you plan to do it
        * Write tests
        * Code
        * Test
        * Ship
        * Maintain

Documentation should contain the following stuff: name; synopsis; other things generally provided in the template by the above-mentioned utilities. Make darn sure you include acknowledgements! Note that if the version number contains an underscore, CPAN marks it as a developers' version.

If you ask for a PAUSE ID, and are rejected, resubmit your application. The guy that handles the apps sometimes forgets.

Modules are "registered" when someone associated with CPAN decides they're "good". Given a choice, pick the registered modules.

Don't submit more modules than you can maintain.

Lightning Talks

This is a series of five-minute talks. The pace is supposed to be fast, so my notes will be pretty skeletal.

Five Development Tools I Can't Live Without
Casey West

        * SQL::Translator can convert schemas from one DB format to another. It can also create a diagram
        * HTTP::Server::Simple::Static provides a tiny web server without Apache
        * Devel::Cover for coverage analysis
        * podwebserver an index of the stuff you have installed
        * Perl::Tidy to neaten perl code
        * Test::Pod::Coverage
        * DBI::Shell
        * Module::Refresh refreshes changed modules in a running script
        * CPAN::Mini to provide a local copy of the latest CPAN modules

Refactoring Web Applications

        * What? Refactoring
        * When? Before adding a feature
        * Why? To simplify feature additions
        * How? WWW::Mechanize and AT

DBD::SQLLite Intro
Zachary Zebrowski

SQLLite is a small OSS DB in a single executable. It is ACID compliant and supports up to two terabytes of data. It has bindings for multiple languages and stores the entire DB in a single file. The file can be moved across platforms and still work. It's handy for rapid prototyping, local tests, etc. On the flip side, it's a single-user DB and isn't networked.

Thirty Seconds or Less
Larry something-or-other

Couldn't get co-workers to learn Wiki markup. He responded to this situation by adding even more markup to his preferred Wiki.

OpenGuides

An application for developing collaborative City guides in a wiki-like way.

Credit Cards

There are new security rules that apply to all merchants. Printed, they're an inch and a half thick. Compliance can be costly. Fines for non-compliance are also heavy.

CPAN Envy
Casey West(?)

Schwern tried to convince people to create a Javascript repository. CW decided to go ahead and create it.

Regexp::Compost
Paul Shields

A perl module that takes as its input a text and produces a list of regexes that match that (and similar) texts. It exhibits an interesting heuristic for trying to create a regex that "fuzzily" matches a sample corpus of texts.

What Has Meng Been Up To Lately?
Meng Weng Wong

Two years ago SPF was born at YAPC in FL. Microsoft decided to "embrace it and extend it to Sender ID, which will roll out in Hotmail and Outlook. On the subject of DK, Meng tried to throw FUD in the air by pointing out that PGP and S/MIME "didn't work". He's also fooling with methods of implementing "collaborative blacklists" called "Karma". From there he went on to describe something that boils down to IM2000.

Perlcast

A podcast dedicated to Perl. Go listen if you're interested.

Annotating CPAN

The idea is to permit users to annotate packages, particularly where they thing there's a gap in the documentation.

A Mail Server in Perl
Matt Sergeant

Matt maintains QPSMTPD. It reputedly handles ~1M messages per day on some hosts.The author bills it as mod_perl for email.

Wednesday, June 29, 2005

YAPC 2005 Day 2

Day 2

As before, these are my raw notes from the talks attended on day 2 of the YAPC conference.

Apocalypse Now - Perl 6 is Here Today!

Part I

Autrijus Tang attempted to comfort the crowd by pointing out that the transition from Perl 4 to Perl 5 did manage to complete successfully. And by analogy, if that transition took years, we shouldn't lose hope for this one.

Reviewing the features of Perl 4, Autrijius mentioned modules, closures, references and CPAN. As of today, CPAN has more than 8,000 modules by more than 2,500 authors. It also offers a tool for automatic download, testing and installation. CPAN is the greatest thing ever to happen to Perl. However, Perl is not necessarily the greatest thing to happen to CPAN. The Perl5 syntax imposes a tax on CPAN users.

For example, sigils and references don't really behave consistently:

        $s                @a                %h
        $$s                $a[0]                $h{'!'}
        $s->foo()        :-(                :-(

One explains to a newbie that the '$' sigil indicates a singleton, whereas '@' indicates a multiplicity, and '%' indicates a multiplicity tagged with names. But when dereferencing a reference, the '$' operator doesn't indicate anything about cardinality.

As another example, there is considerable forced redundancy. The line "my $self = shift' may be as much as 10% of your code, if you write nice short subroutines like you should. The book "Higher-Order Perl" offers all sorts of fancy technology, but about 50% of it is nothing but bookkeeping.

We're used to this "language tax", because the language is useful enough to make us tolerate the cost. But in Perl 6, the tax is not necessary. In the case of sigils above, the situation resolves to:

        $s                @a                %h
        $$s
        <<crap, he switched slides too fast>>

Another savings pertains to "perl" the runtime engine, as opposed to "perl" the language. XS modules in Perl 5 are difficult to write and to maintain. Backward compatibility prevents us from cleaning up that situation. One engineer likened this situation to a game of Jenga: we have a convoluted pile of sticks, and we enhance it by piling more on top while creating holes lower down. All the guts of Perl have multiple applications because we're forced to keep drawing sticks from elsewhere in the pile.

Inventive CPAN authors began creating Perl 5 dialects, using source filters created with tools like Filter::Simple to build a "preprocessor" stage. These dialects are mutually incompatible. Source filters are guaranteed not to work together, because each expects a strange dialect as its input.

Another problem in Perl 5 is that the regex syntax imposes an arbitrary limit on what can be done. Similarly with operators and other aspects of the grammar. The Perl regex engine is not re-entrant, so modules built on them are guaranteed to be fragile.

OO and functional code unfairly handicaps golfers. Golfers always converge on line-based text processing with y// and s// as their two main operators, because the syntax becomes too verbose elsewhere in the language. Unfortunately, that affects regular programmers instead of golfers, because we always take the path of least resistance.

Fast Forward to Perl 6

Perl 6 is faster: soft, incrementally typed, with program-wide optimizations. Stronger types can be retro-fitted to a program. It's also friendlier: you can call Perl5, C, Tcl or Python libraries directly. It's also easier: common usage patterns are dramatically shortened. You can even read a one-liner with comprehension! It's stronger, because it includes support for OO, Functional and Data-driven styles. It's leaner because it offers sane semantics instead of endless workarounds (that have become entrenched idioms, of course).

Autrijus then demonstrated a perl 6 script for uploading modules to CPAN. It was written to be interpreted using pugs, and has some workarounds for perl6 weaknesses by evaling some code in a perl5 context. You can run all your perl5 code under perl6 by writing a perl6 script that pulls it in and runs it. But before you laugh, remember that the early perl5 scripts still used perl4 libraries, because there wasn't a body of perl5 libraries available. The first perl6 scripts will behave similarly.

Today the pugs interpreter cheats by linking in libperl, so of course it works. As an aside, it is also subject to perl5 garbage collection bugs as well.

Going the other way, you can imbed perl6 code in perl5 scripts by invoking the "use pugs" pragma in your perl5 code. You can turn off the perl6 behavior with the "no pugs" pragma. That works today (but, not very well). Sicne it uses Inline::Pugs and source filters, it won't play nicely with source filters. It also invokes a pugs process and shuffles code back and forth to it, so you can't use it with things like DBI. Don't use it in production today. Folks are trying to fix it up so it works more reliably--stay tuned for the second part of the talk. You can get the pugs stuff at http://www.pugscode.org/.

In response to a question, Autrijus replied that he has no idea when the perl6 camel book will come out. There are two books on the market already--the only problem being that 50% of the material is outdated, and you don't know which 50%. The questioner seemed a bit frustrated and kept demanding some hint when perl6 will be production quality. He turned the question around beautifully by suggesting that the questioner could do a great deal to hasten the day.

Pugs was started on Feb 1, so it's only five months old. It started as an exercise based on "Types and Programming Languages", which is another recommendable book, along with "Higher-Order Perl". The exercise at the end of the book said, "Implement a small language that you know of." (There was general laughter at this point.) The first cut took only two days, because it was written in Haskell. Haskell is optimized for compiler writing. As an aside, Autrijus commented that Perl 5 closures leak memory horribly. Switching to lazy references resulted in random seg-faults. These problems are what drove him to use Haskell.

Autrijus summarized the progress to date, but it was a list of test scripts and such, so I didn't write it all down. Recently, someone created a pugs-based IRC bot called evalbot, so IRC users can test evaluate perl6 commands. The most popular command seems to be 'system("rm -rf /")', but "we aren't completely mad, so that doesn't work." Haskell makes it easy to identify unsafe commands, so a safe mode was created that prevents users from doing things like "print" or "system".

Pugs has over 100 committers, mostly because Autrijus hands out committer permissions to passers by. There have been over 5,000 revisions committed. There seems to be a positive feedback cycle, because when someone commits something, they seem to commit to something. Autrijus's own contributions are falling off over time--currently, below 40% of commits. His main function is maintaining a pugs journal and praising contributors.

The "State of the Carrot" speech claimed that pugs covers 80% of Perl 6 semantics, but that's wishful thinking. To a first-order approximation, some 30% of the semantics is covered. There are 7,000 automated tests today, with another 15,000 anticipated.

Part II

If you're interested in contributing to pugs, the first rule is: to report a bug, write a test. Test.pm is really simple to use, if Autrijus does say so himself. A pretty little graph created by Test::TAP::HTMLMatrix that shows passed tests (green), failed tests (red), todos (dark green), etc. One cute aspect of this tool is that, if tests == bug reports, and tests are added to the end of a test script, then red marks at the end of the graph generally represent new bugs.

One attendee asked about the problem of submitted tests that are wrong, or that don't test what they purport to test. Autrijus replied that it isn't common, but it isn't rare either--perhaps 10% of submitted tests have some kind of problem. In the course of maintaining his journal, Autrijus reviews every patch. Others perform code reviews in the case of their work.

Current efforts in pugs concern OO. Introspection and roles are not underway yet. Once the OO features are working, work will focus on rules. Closures don't work yet.

Some benefits of Perl 6:

        * Simplicity
        * Clarity--different constructs look different, such as string eval and block eval (the latter being renamed "try").
        * Sub names are clearer--for example, several clear names replace multi-purpose "length"
        * Brevity: the dot notation saves keystrokes over the arrow notation
        * Implicit parameters are removed (such as $a and $b in sorts)
        * Operator overloading
        * Method overloading via type-based dispatch
        * Some new operators like "yada yada yada" and "defined or" ("$arg //= 3" means "$arg = 3 unless defined $arg")
        * Operator chaining (using transitivity)
        * "Hyper" operators that apply to an entire array
        * Reduction operators that collapse a list by means of an operation
        * Junctions are useful for situations in which complex conditionals are used
        * Type globs are history
        * Smart match (~~) replaces (=~) but has extra coolness like "$a ~~ any(<1, 2, 3>)" (not sure of exact syntax)
        * Exception handling is unified under $!

Some people are distressed that dot as string concatenator is replaced with the tilda. But don't worry: interpolation is so darn good, that the operator is practically obsolete. One attendee complained because he uses ".=" all the time.

Perl 6 has switch statements! They're coded using "given / when".

Problems with Pugs:

        * Can't emit compile-time errors
        * No support for warnings or strictness
        * Slow -- dispatch is 100x slower than perl5
        * <<missed the last one>>

There are also parsing problems due to the ambiguity of Perl and the difficulty of optimizing parrot assembly, so PIL, the Perl Intermediate Langauge, was invented. PIL has far fewer node types (namely 16 at the moment). The result is a sort of "diet perl 6", perl6 like with no syntactic sugar left. PIL is still under development, so it isn't finalized, but it is stable. Other languages may or may not be translatable into PIL.

Note that in the "real" perl 6, the compiler may well not be self-hosting. Among other reasons, proponents of other languages may not like haivng a compiler written in Perl. So Perl/Parot is not the same as C#/.Net. Pugs is entirely different. It wants to become self-hosting, because it's cool to be self-hosting, and it's also desirable to run perl 6 on the perl 5 vm, or to compile to C or some other backend. Self-hosting yields benefits:

        * Well-defined semantics (or it won't work)
        * Less chance of misfeatures like "my $x if 0"
        * Eliminates the low-level black box (such as XS or parrot assembly)
        * Proves that Perl 6 can handle complex tasks like compiler writing--Perl 5 was weak in this area

Status of the Perl 6 Compiler

Patrick Michaud followed up Autrijus's report on pugs with the status of the "official" compiler. Perl 6 embraces more than the Perl six language itself. Interoperability with other languages is also a high priority, which involves a runtime environment compatible with multiple languages (Parrot), a suitable suite of development tools, and a platform for code development. As Larry said, we both should and should not implement Perl 6 in Perl 6.

In short, Perl 6 isn't there yet. The specs are firming up, but still incomplete. Deficiencies in parrot are impacting the compiler now.

Parrot in Detail

What Parrot Is

Parrot is a multi-language VM intended as the interpreter for Perl 6. It started as an April fool's joke that got out of hand. VMs get you several things: platform independence; impedance matching; high-level base platform; good target for a particular class of languages. The major components are the parser, compiler, optimizer, and interpreter.

The Parser converts source code into an abstract symbol tree. The Perl 6 parser can be overridden: new tokens can be added; existing tokens can have their meaning changed; languages can be swapped out. All of this will be accomplished using Perl 6 grammars. Perl 6 grammars are immensely powerful. They're a combination of perl 5 regexes, lex and yacc.

The Compiler turns AST into bytecode. Like the parser, it's overridable in perl 6. Conceptually, it's just a fancy regex. It doesn't perform any optimizations at all. These tools help in building compilers, which is a standard task but involves a lot of work. There isn't much in the way of complexity here. Even lisp programmers won't pretend to have invented it.

The VM is a "bytecode engine" based on 32-bit words. Programs are (or at least can be) precompiled and then loaded from disk. USually no transformation at all is needed at runtime. Most opcodes can be overridden, and the bytecode loader can be overridden. Dynamic opcodes can be used for cool things like loading uncommon functions only on demand.

The VM is register based, having four sets of 32 registers: Integer, String, Float and PMC. The registers are a fast set of temporary locations--all opcodes operate on registers. There are also six stacks: one for each set of registers; one generic stack and one call stack. Stacks are segmented, but have no size limit.

The strings data type is a high-level data type, consisting of a buffer, a length field, and a set of flags. Flags cover things like character set, encoding and length in characters (rather than bytes).

The PMC type wraps the equivalent of Perl 5's variables. Tying, overloading and magic are all swept under that rug, as are arrays, hashes and scalars. All operations on PMCs use the multi-method dispatch core (MMC). This is a recent change, and simplifies vtables and other things. It improves performance, because most interesting PMCs end up using MMC anyway. All PMCs can potentially be treated as aggregate structures: vtable entries have a _keyed variant, and the vtable must decide whether the key is valid.

Objects divide into two types: reference objects and value objects. So objects may or may not be accessed by reference. This makes it possible to handle objects in the same way as low-level types.

At this point the two speakers began flying through the slides, so it became impossible to keep up for a little while.

Why Parrot?

The biggest deficiency of the JVM and the CLR is that they're designed around static code. Classes can't be changed at runtime in the JVM; existing things can't be fiddled with in the CLR; etc. The parrot runtime will support that sort of perlish flexibility. In addition, a VM permits interoperability between different scripting languages, which takes TMTOWTDI to a new level. Today there are lots of inter-compatibility shims, but they grow like n!, where a common runtime permits n^2 growth.

On the other hand, why not parrot? Why hasn't it been adopted yet? Why is the community so small? The first question is, what constitutes a user of parrot? People running parrot applications are not "parrot users", and more than someone running an applet in a browser is a "java user". The developers are the users. Current major users are Pugs, Tcl (ParTcl) and Perl 5 (Ponie). Folks that gave up on parrot often did so because their stress level was too high. They got tired of the moving target, and other things.

Parrot is a platform, which means that it lives or dies on the apps that others write for it. The project's biggest need is developers. The most important of those contributions are pugs and PGE. Another critical need is to keep the specs and design documents up to date. Another is tests: the best way to record a missing feature is to create a test that checks for it.

Parrot Grammar Engine

Patrick Michaud opened with the remark that Perl regexes have been so successful that of course they're being thrown away. For the purposes of this talk, "rules" are perl 6 regexes, and "regexes" are perl 5 regexes.

A brief overview of the rule syntax comes from "Apocalypse 5", though that document is really out of date now.

        * Capturing is still done with parens
        * Star, + and ? work the same
        * Pipes represent alternatives

On the other hand, modifiers go at the start of the expression rather than the end. The /e modifier has gone away: if you want a code block, stick one in there. The /x option is now the default. There are no more /s and /m modifiers; there are new metacharacters. The :w or :words modifier treats whitespace in the pattern as \s* or \s+. There are also new modifiers like :exhaustive, :keepall, :overlap and :perl5.

Among metacharacters, the dot now matches newlines. The ^ and $ always anchor to the start and end of the string. Double ^^ and $$ match beginnings and ends of lines. The # always introduces a comment. Whitespace is metasyntactic depending on the :w setting. Now \n matches newline in a platform independent way, and \N matches anything but (like the perl 5 dot). Now & is a conjunctive match.

Asked for an example of & in use, he replied, he couldn't give a very clear example. He did remark that ampersands can be changed as long as you like, and every item in the chain eats the same number of characters, or the match is a failure.

Although parens capture, numbering is different.

Now square brackets are non-capturing groups, instead of the ugly glyph. Enumerated character classes are <[...]> instead of [].

Curly braces are now no longer a repetition quantifier--it's an embedded closure that is called at that point in the match.

The repetition specifier is now **{...} for maximal matching and **{..}? for minimal matching.

Scalar variables in a rule expression are matched as literally, not as a sub-pattern. I.e., /$var/ matches the content of $var exactly, not $var interpreted as a pattern.

Array references match as the "or" of the elements in the list.

A hash in a pattern causes a match on the longest matching key. The value can be a closure, a sub-rule, the value 1, or else indicates failure. That's useful for building lexers that match according to the "longest matching token" rule.

Angle brackets are heavily overloaded. For starters, a pair of angle brackets whose first content character is alphabetic is interpreted as the name of a capturing subrule. Preceding that character with ? makes it a non-capturing version. A leading ! indicates a negative match. Enumerated character classes are enclosed by <[...]>. Unescaped hyphens are forbidden, because hyphens are not used to indicate a range--you should use .. instead. A negated character class is given by <-[..]>. Plus/minus characters are used to combine subrules and character classes.

To interpolate a string into a rule, use $var. This eliminates escaping problems, among other things.

Double quotes are used to interpolate into a literal match, whatever the heck that means exactly.

Colons are used to control back-tracking behavior. A single colon forbids the rule engine to backtrack past the colon. Backtracking over a double colon causes the current set of alternations to fail. Backtracking over a triple colon causes the entire rule (or subrule) to fail. Backtracking over <commit> causes the entire match to fail (even if it's located within a rule). The <cut> assertion is like <commit>, but also deletes the matched portion of the string.

Capturing and Subrules

None of this stuff is in synopsis 5, which wasn't clear about how capturing works. It's something like this:

Every invocation of a rule returns a "match" object, which goes into a lexically scoped $/ (instead of perl 5's $0). The boolean value of $/ is true if the match was successful, otherwise false. The string value of $/ is the substring of the pattern matched. The array value of $/ contains the matches for subpatterns in parens. These can also be accessed as $0, $1, etc.

When performing a match, any capturing subpattern generates its own match object, which is stuck in the appropriate place in the parent rule's match object. One consequence is that we can no longer count parens to determine the location of the match object, because the match objects are arranged in a tree. Non-capturing patterns don't impose a capturing scope, so they don't force branching in a tree. But what about quantifiers?

When a pattern is quantified with + or * (but not ?), it produces an array of match objects--one for each match that took place. This is true regardless of the number of matches that occurred--the list can be empty or a singleton. This makes for all sorts of interesting weirdness when you factor non-capturing patterns into the mix:

        / [ \w+ (\s+) ]* / yields:
                $/[0]                An array of match objects
                $/[0][0]        Whitespace of first iteration
                $/[0][2]        Whitespace of second iteration, etc.

Subrules look just like subpatterns, but they capture to a match object's hash with the name of the subrule as the key. So in the pattern /<digit>/, the match is stuck into $/<digit> or $/{digit}. So I gather that wedges can be used for subscripting a hash now. Interesting. (In response to my question, Pat stated that it's "basically a qw".) If the same rule is used more than once, the matches are gathered into an array keyed by the rule name. Sets of named rules can be combined easily to build parsers. Like that ain't bleedin' obvious, since $/ is a bleedin' parse tree!

Each call to a rule generates a coroutine. A coroutine is a subroutine that returns control to the caller, but can be restarted where it left off with its state intact.

Currently the PGE expression parser is a recursive descent parser with thirteen node types (at the moment). The most complicated part of the parser is the handling of groups and subrules.

Using PGE

PGE builids as part of parrot. To run it, you want to load the PGE module and use p6rule. He went past quickly with the screen, so I'll need to check the details online.

A grammar is any class derived from PGE::Rule that contains rule methods. Thus one can create a new grammar in a straightforward way. Subrules, on the other hand, aren't necessarily rules. Any subroutine that returns a Match PMC with some appropriate attributes can be used as a subrule. This can be used as a path to creating parsers that don't operate by recursive descent, or at least don't use it for particular subsets of the language.

Monday, June 27, 2005

YAPC 2005 Day One

This week I'm attending the North American YAPC (Yet Another Perl Conference) in Toronto. This is my first YAPC, since I didn't make it to Buffalo last year.

It's an interesting display of organizational prowess, so far. Last night the schedule said we could register between 22:00 and 23:00, but when we arrived there were no registration personnel to be found. The conference organizers present had no keys to the room with the nametags and t-shirts. The next morning, when we returned to register, one of the staffers put out a call for a video camera to record the proceedings (or at least the keynote).

We're now into the "opening ceremonies", scheduled for 9:00 but beginning at 9:30. Glad it's not me trying to organize this thing--I'd be frazzled silly.

Keynote: Larry Wall

Larry opened his remarks with a picture of the Golden Gate Bridge, and described his talk as "bridges and other things". His focus was on the idea of building communities, particularly of course the Perl community. He suggested that the right questions for OSS authors to ask are:

        * Who will naturally be interested in my project?
        * Who are we accidentally/purposely excluding from our project?
        * Who should lead/follow/contribute?

        * What is the goal of our community?
        * What can people contribute?
        * What are the community's rules and structure?
        * What's in it for the volunteers?

        * Where will the community meet in cyber/physical space?
        * Where can sub-communities form, either by design or spontaneously?

        * When is it too soon to form a community?
        * When does the community reach a "tipping point"?
        * When is it time to form sub-communities?
        * When is the right time to fork?
        * When are we done?

        * Why do we really want a community?
        * Why do people join and leave the community?
        * Why do people fight and stop fighting?

        * How do we do it?

On the question of when to form a community, Larry remarked that in the first days of Perl, users proposed that Larry start a Perl newsgroup. He resisted this request for about a year, because he wanted Perl users to infest the shell-users' newsgroups. As a result, Perl users promoted Perl by injecting Perl-ish solutions in addition to shell-ish ones.

Concerning the question of our real motivations for starting a community, Larry remarked that the desire for a large pool of free labor isn't a very good motivation for community building.

Waxing philosophical, Larry invoked the idea of "tensegrity", or "tensional integrity". The idea is that a stable structure results from balancing the forces that "pull" and the forces that "push". He accompanied this with several pretty pictures involving rods and rubber bands. By contrast, he suggested that the "geek" community resembles a big pile of rocks (geeks)--disorganized, formless, and without opposing forces (each member autistic to one degree or other). The linux community might be represented as a few stone pillars (distributions) around a beach, with Linus at the top, and where of course the users are dirt. A militaristic model would be a single tall tower--stratified, in which pecking order is the foremost consideration.

A more dynamic community exhibiting "tensegrity" involves both pushers and pullers, and requires us to grapple with seeming contradictions. The result of such a dynamic is "larger structures that don't fall down". Among the "contradictions" that must be reconciled to build a community are:

        1. People are naturally good / bad.
        2. People love / hate new ideas.
        3. People love / hate outsiders.
        4. People should all be alike / different.
        5. People should / shouldn't be in charge of others.
        6. It should be easy / hard to break into the community.
        7. We should / shouldn't try hard to keep people in.
        8. Specialization is good / bad.
        9. People volunteer for altruistic / selfish reasons.
        10. We do / don't need a benevolent dictator.
        11. Larry Wall is important / unimportant. Or more to the point, he's a genius / idiot.
        12. Modernism is good / bad.

From there he rambled off into a discussion of "natural communities" from a darwinian perspective, invoking notions like a "large gene pool", "speciation", "range of variation", etc. He suggested that part of what's keeping Perl 6 is the effort to ensure that it's modeled on natural communities that thrive, rather than those that go extinct. For example, the Perl 5 community is seen by Larry as neither sufficiently unified nor sufficiently diverse.

        A community needs to share a set of core values, but also to allow honest differences on the periphery.

A technological solution to these set of problems both is and isn't possible. On the positive side, Perl 6 will have a finer-grained extension mechanism. Among other things, scoping will be clarified and cleaned up. A CPAN-like repository can provide a gene pool. We can separate combatants, if we can convince them to join separate mailing lists. Technologically, we can at least provide enough mailing lists for hostile tribes to coexist peaceably.

On the negative side, people are still basically irrational. One way to mitigate this is to look for cheerleading opportunities. We can try to tolerate differences within the community, but it ain't easy. We want to encourage and discourage cultism. We want to "have fun", but we can't always. Sometimes building a community involves submitting to crucifixion.

Allison Randal: The State of the Carrot

A "carrot" is what you get when you cross a camel with a parrot. Allison read a parody based on "The Hunting of the Snark".

Over the past year Perl 5 has experienced a bunch of fixes and optimizations. More interestingly, reverse sort no longer uses an intermediate list, which improves performance. Some setuidperl exploits have been fixed. PLEASE stop using setuidperl. Another -Dusersitecustomize option permits customization of @INC using a site customization script.

On the Perl 6 front, there are many pieces.

At the bottom is Parrot, the VM for Perl. There was the parrot 0.1.1 release last october, including incremental garbage collection and a "make install" target. Parrot has moved to subversion in version 0.2.0. Current version is 0.2.1.

Next is Ponie, the Perl 5 compatibility layer. Snapshot 4 was released today. Ponie work has benefited Perl 5 as well, because improvements made in Ponie are being back-ported to Perl 5.

Pugs is a Perl 6 prototype. Some 80-90% of the Perl 6 semantics have been implemented already. It's currently written in Haskell, but ultimately it should be written in Perl 6.

Allison also reported some things about the funding of the Perl foundation, Perl Mongers and Perl Monks. There's a new Perl logo (a pearl onion) that can be used without legal encumbrance, because O'Reilly owns the camel.

Session 1: The Tester's Toolkit

Pete Krawczyk opened with the usual ra-ra in favor of automated regression testing: tests supplement documentation; they facilitate bug reports; reduce maintenance costs, etc.

A "test" is a perl program with extra modules, that reports actual versus expected results. Tests are usually invoked via a "test" target in the makefile. Another useful command, as of Perl 5.8.3, is "prove", which runs a directory of tests. A script named t/TEST is also sometimes used. But since a test is a Perl script, you can run it by hand (but without the summary features provided by the test harness). Example code for this talk is found in the Acme::PETEK::Testkit module.

Considerations when writing tests:

1. Make sure most important code is tested. People don't actually test every branch of code, and ROI diminishes as you jump through hoops trying to achieve complete coverage. Conversely, from zero tests, every added test is an improvement.
2. Test scripts should have a "plan".
3. Don't print to STDOUT! Use diag() instead. Testing scripts that print to stdout may involve extra work to capture output.
4. Test for failure as well as success.
5. Give tests a description. If you don't you'll be stuck figuring out which one was "test 50046".

Moving on to testing specifics, Pete introduced Test::More by showing some examples of the standard tests, use_ok(), is(), is_deeply(), cmp_ok(), can_ok(), etc. Tests can be put in a "SKIP" or a "TODO" block, and Test::More will handle them gracefully.

The prove script handles invocation of tests in a folder, using Test::Harness, simplifying the relevant Perl one-liner. It also has extra features such as "verbose" and "shuffle" modes. It is intended as a development tool, to run tests with some granularity during debug/test cycles.

He went on to talk about Test::Inline, and I didn't pay close attention to that section. Including tests within the script to be tested is a matter of taste, and my taste doesn't run that way. I also skipped the Test::Legacy section. It's intended purely to migrate tests written with Test.pm to the new testing framework. Likewise, I already know about Test::POD.

For those who shy away from complex tests, you can use Test::Simple. It's a subset of Test::More functionality, so you can retarget tests written with Test::Simple to Test::More when you need the additional features. Test::Simple has only one method: ok().

Other test modules exist for things like performing web browsing or accessing a database. Most of these test modules combine nicely. Apache::Test specifically is the topic of another talk at YAPC. It appears to provide a kind of "sandbox" for testing. In that vein, it handles issues of hosts and ports, so the test writer doesn't have to worry about it. It uses Test::Legacy syntax, so potentially offers interoperability issues with other testing modules. Test::More support is being added, but should be considered experimental today. Pete showed an example using Apache::Test, which I looked at cursorily, since I plan to attend the Apache::Test talks later today.

Test::WWW::Mechanize can be used to perform traversal of sites. It handles cookies and form values, etc. It can be used with Test::Apache, where Apache::TestRequest::module2url() is used to convert relative URLs to something usable against the "sandbox" Apache instance.

Test::DatabaseRow can be used to perform simple tests against the database. You assign it a database handle to run against, and it can generate some SQL for you.

Test::Expect can be used to test console apps, including tests of remote applications using ssh or telnet. It uses a syntax reminiscent of Expect, as you might expect.

Test::Differences puts test diffs in a table for viewing. This can be useful for determining which parts of a test suite did not behave as expected.

Test::SQL::Translator can be used to verify the correctness of a DB schema.

To determine how much of your code is covered by your tests, you can use Devel::Cover from CPAN. It runs transparently with your tests, and compiles statistics on your code coverage. It can generate HTML output for viewing in a web browser, with color codes.

General tips:

        1. Write a test for each bug you fix.
        2. Automate your automated tests.
        3. Consider test-first development.
        4. Help write tests for others' modules that you use.
        5. Encourage others to test their code.

chromatic & Ian Langworth: Solutions To Common Testing Problems

General Enhancements to Test::More

People usually start with Test::More, but soon end up wanting better diagnostics than it provides. For example, people commonly use is_deeply(). The benefit of using it is that it will highlight, on failure, where in the data structure a disagreement is found. You can use diff to get all differences, but is_deeply() gives only the first point of disagreement. Test::Differences provides a similar functionality but shows all differences. It also shows per-line differences between multiline strings.

When using is() to compare strings, both strings are printed in full. This can be useless for comparison, especially if the strings long. Test::Longstring addresses this problem, via functions is_string(), contains_string(), lacks_string().

Beyond strings, another target of testing is nested data structures. One approach to them is to focus on the composition of the structure, rather than its content. Test::Deep offers cmp_deeply() for this purpose. You can tell is_deeply() what the structure should look like in general terms. One argument to cmp_deeply() is a template for the data structure to be tested. The template can specify an array, a subhash or superhash. Another supported concept is a "bag", which batches bags (i.e., unordered sets possibly containing duplicates).

Testing with Databases

One tricky part of testing is that more than Perl code needs to be tested. One trick you can use is mock objects to mock the DB, but that isn't always the best trick. You can use a different database instead, with test data. Or you can connect to the live system for testing.

To mock the DB connection, you can use the DBD::Mock module. One of the obvious candidates for testing with a mock DB is to test failure modes, such as login failure. In the mock object, you can set flags to simulate login failure, DB connection going away, success, etc.

One candidate for a substitute database is DBI::SQLLite, which accepts SQL commands but has no network connection, multi-user support, etc. This can be used for inserts and selects, without affecting the target DB.

The final approach to DB testing is to use the same DB back-end as the production environment, with a test data set. In the build file, the installer can be prompted for a DB name, user, password, etc., to use in tests. To the end of putting this stuff in your build file, you can use Module::Build. It's much easier than make-maker. Among other things, it provides facilities for prompting users for settings, along with logic to adopt the defaults when performing an automated build.

Testing Web Sites

With Test::WWW::Mechanize you can simulate a web browser to test web sites. It provides methods for submitting forms, clicking links, etc. It also provides test methods for examining titles and other page elements. Another utility, included with Test::WWW::Mechanize is mech-dump, which can be used to examine the structure of a page, for example to learn the name of a form if you don't already know it. Instead of mech-dump, you can use HTTP::Recorder to create a proxy and examine data as it flows back and forth. The proxy can be used to pop up an additional page displaying some information about the exchange. Note that HTTP::Recorder is new and limited: it doesn't handle SSL or Javascript.

The HTML can be validated as a whole using Test::HTML::Tidy. If you only want to check certain things in the HTML you can use Test::HTML::Lint. In response to a question from the crowd: the speakers don't know if there's a handy module for testing XSS vulnerabilities.

Testing with Mock Objects

Mock objects are handy when testing conditions that are difficult to produce for one reason or other: supplying a missing network connection; pretending to reformat the hard drive; simulating obscure failures; etc.

For example, suppose you want to test a bit of code that makes a system call that may or may not succeed on the testing machine (for example, due to lack of speakers at the time of test). This can be done by overriding "system" as follows. Note that "system" must be overrided before the module to be tested is loaded.

        package TestModule;
        use subs 'system';
        *TestModule::system = sub {
        }
        [..]

One word of advice: tests involving mocking like this should probably be run in their own files, so weird things like overridden functions don't have side effects that leak into other tests.

Mock objects to be created for testing with the Test::MockObject. The author has written some articles about Test::MockObject on Perl.com.

Unit Testing with Test::Class

Test scripts discussed so far in this session are procedural. Test::Class treats tests as objects. A new test is created as a subclass of Test::Class. This class implements an analogue of Ruby fixtures, which are services such as the setup and tear-down surrounding execution of test cases. To use a test class based on Test::Class, simply use the module you've created and execute the class method runtests().

There are facilities for skipping tests in units of one class (possibly including all its subclasses).

Another advantage of Test::Class is that you can ship the test classes with your package. Then users that subclass your objects can also subclass your tests and leverage your effort.

Test::Class also facilitates creation of test plans by allowing you to specify test counts piecemeal, and then collecting them into a plan for you.

A Few Cool Things about mod_perl 2.0

The list of mod_perl directives has grown some. The list of classes has grown tremendously.

Writing a PerlTypeHandler didn't work before, because mod_mime has a stranglehold on the request. That has changed. You can now write PerlTypeHandlers easily. You probably won't, because nobody ever needs to, but you can.

What about Apache2::Const::OK? Big changes came right before mod_perl 2.0 because of the "Great Namespace War" of 2004. So now there's something you need to know to migrate mod_perl 2. What you need to know is that all Apache:: modules now live in the Apach2:: namespace. The only exception is Apache::Test. That includes Apache constants, because they are fully qualified by their package:

        - Apache2::Const::OK
        - APR::Const::SUCCESS

But no matter what the docs say, you don't need to use -compile. Just do a:

        use Apache2::Const qw(OK);
        ...

It isn't too hard to migrate to mod_perl 2--you can practically do it with a sed script, as long as you're on mod_perl 1.99. Going from 1 to 2, see last year's talk, "Why mod_perl 2.0 Sucks". But back to what's cool about mod_perl 2.0...

Apache 2.0 has over 340 directives, but only 90 are from "core" Apache. The rest are from extension modules. Those extension directives must be wrapped in IfModule directives in the Apache config. Both versions of mod_perl provide an API for defining new Apache directives, but the API in 1.0 was too intimidating. The 2.0 directive handler is in pure Perl.

Total Access is another cool mod_perl 2.0 feature. Whereas version 1.0 was incomplete, 2.0 offers complete access to everything in the Apache API. There's even a method called assbackwards() (whatever that does).

Output Filters are a new feature in Apache 2.0. Output filters are "things" that allow you to post-process content after the content phase has run. One example of an Apache filter is the one that processes SSI tags: the output of CGI scripts are not run through that mechanism, so CGI scripts can't use SSI tags. Although mod_perl has been able to filter content for years, it was previously only able to process mod_perl output itself. Now it's possible to filter output at a later stage. It's now possible to use mod_perl to filter output of PHP scripts, for example.

Stacked Perl Handlers are an idea borrowed from Apache that mod_perl 1.0 didn't get right. In Apache, how the module list is traversed depends on the phase. In some phases, the handler list is exhausted. In other phases, the list is traversed until the first handler returns OK (authentication is one example of this). The mod_perl 1.0 version didn't allow for early termination on return of OK, but it's now been fixed. One effect of this fix is that PerlAuthenHandler was able to be re-written very nicely, without all the ridiculous workarounds.

To finish with a plug: Apache::Test totally rocks, and you should use it for everything, every time.

Perl Black Magic: Obfuscation and Golfing

José Castro introduced himself by requesting that people stop calling him "Hosé": being Portuguese, the correct pronunciation is Joe-say. From this lighthearted beginning, José proceeded to give a hilarious presentation concerning obfuscated Perl. His pace was much too fast to keep up with, so I won't try to capture his talks in detail. Just remember to check out his slides when they become available at http://www.jose-castro.org/talks/index.html. Here are a couple of teasers:

To impress your friends with obfuscation, you have to give them something that they don't understand right away, but do understand eventually. If, when you explain what a script does, they still don't understand it, they won't be impressed with your cleverness. But if they find out that your incredibly convoluted script prints, "Just another Perl hacker," they'll be impressed.

Some clever ways to make your Perl incomprehensible include:

        1. Gratuitous use of the ternary operator
        2. Adding distractors in the untaken branch of the ternary operator
        3. Using lots of semicolons you don't need
        4. Remove whitespace to enhance unreadability
        5. Never use /// in regexes: "ss from s to s" is much more confusing
        6. Use lots of pound signs, until people can't tell what's a comment and what ain't

Definition: "Golfing" is the art of programming with as few characters as possible. You start, of course, by eliminating all whitespace. And you never use a variable name longer than one letter. Of course you leave off semicolons whenever it's allowed. Above all, you shouldn't forget to exploit the power of Perl's command-line options, of which "-n" is only the tip of the iceberg.

Two clever operators:

        The Eskimo operator: }{

This cute little number looks very confusing, but think what it does at the start of a script invoked with "perl -n"! Check the manpage if you can't figure it out directly.

        The shopping-cart operator: @{{}}

This operator can be used to perform operations within strings, for example, as long as the innermost code block returns a (possible empty) array reference.

Monday, June 06, 2005

Church Politics and Patience

It's an enlightening experience, taking part in church committees. I've been on several--though never as a volunteer. There always seem to be a few people engaging in political maneuvers that would seem more at home inside the beltway. You'd think Christians would behave better, but if so you'd be naïve: Christians are people too. On the other hand, Bible-bashers would use those bad apples as proof that Christianity is a farce, but that's intellectual dishonesty on par with arguing that the police force should be disbanded because some cops are crooked.

Seeing these political games first-hand has taught me a lesson or two in patience. It's my nature to be outraged when people spread misinformation, conduct behind-the-scenes campaigns, and try to force issues in their preferred direction. But why? Because it's dishonest? Because it's not Christ-like? I'd like to think that my motives are that pure--that I'm really that driven by principle. Certainly that's part of it. But the main reason I'm so outraged is that I fear these efforts might succeed: the misinformation might be believed; the bad counsel might be followed; wrong-doing may carry the day. In other words, my outrage is partly fear that wrong will triumph, and partly an impatient need to seize control immediately, and to personally make sure that wrong is put in its place.

What I've found, in being forced to handle these things patiently, has been very helpful. For one thing, there are plenty of other men of principle who will do their part, and when I wait I'm often rewarded with such people stepping up to the plate. The world doesn't rest on one person's shoulders. On the other hand, I've also seen that wrong does sometimes win, and there's nothing I can do about it. Sometimes misinformation is believed, slander is repeated, good deeds are punished and bad deeds rewarded. It's rather liberating to wake up to the reality that forcing the right outcome just can't be done by a mere mortal. Scripture says that "evil men and seducers shall wax worse and worse, deceiving, and being deceived" [2 Tim 3:13]. Christ himself said, "when the son of man cometh, shall he find faith on the earth?" [Luke 18:8]. The Bible (not to mention experience) assures us not only that evil exists, but that it often gets the upper hand in this world. It doesn't do any good to go crazy raging against that reality.

Patience becomes possible when we accept that we aren't in control, accept that things will indeed come out wrong, and trust that God is in control and will ultimately put things right. Without trust in God, acceptance of reality makes us cynical and ultimately leads to despair. On the other hand if we don't accept the reality that evil often triumphs in the short term, we will constantly fight battles we can never win. That's a waste of energy at best, and it threatens to weaken our faith in God's ultimate fairness.