FilterProxy::Rewrite Config

Back to FilterProxy main configuration.
Name Action Finder Operation Submit
ADS
COMDELADS
DUNNO
SCRIPTADS
SCRIPTS2
TABLEADS
WEBBUGS

How do Matchers work?

A "Matcher" (currently either tag, attrib, or regex) is applied to the file to find the content desired. So a Matcher like

tag <img src>
will match all 'img' tags which have a src attribute (regardless of the value of that attribute). A matcher like
regex /blue chickens/
will match all occurances of the string "blue chickens". You can then apply "add" to expand the match beyond the initial tag or regex. For example,
tag </(a|img)/ /(src|href)/> add tagblock <script>
Will match any 'a' tag with an 'href' attribute, or any 'img' tag with a 'src' attribute. (and also <a src> and <img href> but these are nonsensical) The add will then expand the match to include a <script> block that follows the initial <a href> or <img src>. You can use the encloser matcher insead of tag to cause it to grow to a <script> block that encloses the initial <a href> or <img src>.

add alternate adds "alternate content". In other words, if you match a <script> block, it's alternate content is a <noscript> block. This is usually used to show banner ads to browsers which don't support javascript, or don't have it turned on. Often it's easy to match the ad inside <noscript> but almost impossible to match a javascript ad. Since these are often right next to each other in a page, alternate will consider them one block. alternate also knows about <layer>, <ilayer> and it's alternate content <nolayer>, and <frame> and it's alternate content <noframe>.

add balanced adds "balanced enclosers". In other words, if you match <img src=...> and it has a <center> preceeding it and a </center> trailing it, balanced will consider the center tags part of the match. It continues adding in balanced enclosers until it reaches a leading tag that does not have a corresponding closing tag trailing the match. balanced ignores whitespace, comments, and a few other tags like <br> and <p>.

Clever combinations of add, balanced, alternate, and encloser can make most pages look like it never had an ad.

Once your match is found it is either stripped or rewritten. Strip should be obvious (removes the match from the page). Rewrite requires the matcher to be followed by as [block]. The match will be replaced by the text following the as keyword. No interpretation is done of the as part, it is simply replaced verbatum.

How do names and 'ignore' work?

Each rule can be named, so that if a rule BADRULE destroys the layout of one page, you can create a site regex for it (on the FilterProxy main page) which will contain the rule ignore BADRULE. This will cause BADRULE to not be applied to sites matching that site regex. You don't have to name your rules if you don't want to. You can even name ignore rules, so that you can ignore your ignore rules. But that is probably a little silly. Rules and ignores are processed in alphabetical order, so if you want one rule or ignore to be processed before another, you can preceed the name with a number (i.e. 1_MYRULE), or just name it something that comes before it alphabetically.

Hints

Speed Considerations

Terse syntax description:

  The basic syntax is:
    [NAME:] command matcher [[qualifying predicate] [expanding predicate] ...]

    Note: [] means "optional", {} means "mandatory", ... means "more than one"
          <> are literal, and must be included as part of the rule.
  Commands:
    strip                         remove from file
    rewrite {matcher} as {html}   change matched text to something else
    ignore {NAME} [...]           ignore a named rule (can specify more than one)

  Expanding Predicates: (modifiers that expand an existing match)
    add {matcher}
          grow match to include text matched by [matcher] (use a matcher below)
          (can apply more than once, order matters)  if [matcher] is one of
          (tag, tagblock, regex, attrib), the match will grow forward from the
          previous point until it finds [matcher].

  Qualifying Predicates: (modifiers that also must match in order to consider it a match)
    inside {matcher}
          like encloser, except that the match that preceeds it must be *inside*
          the match that follows it.  This does not change the original match
          (use add encloser instead if you want to strip the thing it's inside).

    containing {matcher}
          the matched block must contain {matcher}

  Matchers:  Each matcher "finds" a block of text that gets passed to the
      predicates that follow it.

    tag [options] <{tagname} [attrib[=value]] [...]>  
          Will grab all content enclosed by tag Any of tagname, attrib, or
          value can be a regular expression by enclosing them in one of these
          regex delimiters: [/#%&!,=:].

    tagblock <{tagname} [attrib[=value]] [...]>  
          Matches to the closing tag corresponding to the tag specified. (like
          old 'tag -tagblock')

    attrib <{tagname} attrib[=value] [attrib[=value]] [...]>     
          Will grab the attribute specified.  Note that you can specify more
          than one attribute, and the *first* one is the one that will be
          stripped/rewritten, but the tagname must match and other attribs are
          required to be present.

    regex /regex/
          Match any (perl) regex.  Regex must be delimited by one of:
          [/#%&!,=:].  Note that this does matching (m//), not s///, 
          tr/// or y///. (yet)

    encloser <{tagname} [attrib[=value]] [...]> 
          Like tagblock, except that the block must enclose the previous match.
          (only makes sense as argument to 'add', and should really be named 
          "enclosing tag block" but that's too long)

    balanced                                      
          Grow match to include "balanced" tags that have the tag preceeding
          the match, and the corresponding closing tag trailing the match (with
          nothing in between).  Only makes sense as argument to 'add'.

    alternate                                     
          Grow the match to include "alternate content".  i.e. script/noscript,
          frame/noframe, layer/nolayer etc.  Only makes sense as argument to 
          'add'.

  In all cases more than one attrib can be specified.  You may chain as many
  matchers and predicates as you like, but if it starts to get too long it will
  probably be ambiguous not do what you might expect.  (I need a BNF form
  grammar for this syntax...)
  

Rewrite was written by Bob McElrath. Please see the README, BUGS, and any relevant module documentation before mailing me with problems.