FilterProxy::Rewrite Config

How do Matchers work?

A "Matcher" (currently either tag, attrib, or regex) is applied to the file to find the content desired. So a Matcher like

tag <img src>

will match all 'img' tags which have a src attribute (regardless of the value of that attribute). A matcher like

regex /blue chickens/

will match all occurances of the string "blue chickens". You can then apply "add" to expand the match beyond the initial tag or regex. For example,

tag </(a|img)/ /(src|href)/> add tagblock <script>

Will match any 'a' tag with an 'href' attribute, or any 'img' tag with a 'src' attribute. (and also <a src> and <img href> but these are nonsensical) The add will then expand the match to include a <script> block that follows the initial <a href> or <img src>. You can use the encloser matcher insead of tag to cause it to grow to a <script> block that encloses the initial <a href> or <img src>.

add alternate adds "alternate content". In other words, if you match a <script> block, it's alternate content is a <noscript> block. This is usually used to show banner ads to browsers which don't support javascript, or don't have it turned on. Often it's easy to match the ad inside <noscript> but almost impossible to match a javascript ad. Since these are often right next to each other in a page, alternate will consider them one block. alternate also knows about <layer>, <ilayer> and it's alternate content <nolayer>, and <frame> and it's alternate content <noframe>.

add balanced adds "balanced enclosers". In other words, if you match <img src=...> and it has a <center> preceeding it and a </center> trailing it, balanced will consider the center tags part of the match. It continues adding in balanced enclosers until it reaches a leading tag that does not have a corresponding closing tag trailing the match. balanced ignores whitespace, comments, and a few other tags like <br> and <p>.

Clever combinations of add, balanced, alternate, and encloser can make most pages look like it never had an ad.

Once your match is found it is either stripped or rewritten. Strip should be obvious (removes the match from the page). Rewrite requires the matcher to be followed by as [block]. The match will be replaced by the text following the as keyword. No interpretation is done of the as part, it is simply replaced verbatum.

How do names and 'ignore' work?

Each rule can be named, so that if a rule BADRULE destroys the layout of one page, you can create a site regex for it (on the FilterProxy main page) which will contain the rule ignore BADRULE. This will cause BADRULE to not be applied to sites matching that site regex. You don't have to name your rules if you don't want to. You can even name ignore rules, so that you can ignore your ignore rules. But that is probably a little silly. Rules and ignores are processed in alphabetical order, so if you want one rule or ignore to be processed before another, you can preceed the name with a number (i.e. 1_MYRULE), or just name it something that comes before it alphabetically.

Hints

To remove an advertisement from a web page, use your browser's 'View Source' function. Find the HTML code for the ad, and write a rule to match that HTML. It is also useful to get the document using wget, lynx, curl, or other command-line tool, and look at it with a text-editor since the document your browser shows with 'View Source' may not be the same one sent by the server (usually due to presence of javascript, or FilterProxy already doing some filtering!). Be aware, however, that many servers will send different files if it can detect the browser you're using. That is, if the server can detect that you're using lynx (via the User-Agent header), it will not bother sending the javascript ad... Thus netscape/mozilla and lynx/wget will receive different documents. Also note that when you hit 'View Source' in your browser, your browser may choose to reload the document, so the document that appears in the view-source window may not be the same as the one rendered.
Try to find the element in the ad that is most indicative of it being an ad, and use add to expand the match to include any javascript, tables, or forms. Do not, for instance, write a rule like:
```
strip tag <img width=468 height=60>
```
since many sites have 468x60 images that are not ads.
If you want to temporarily disable a rule, without deleting it, add an 'ignore' rule for its name.
Many advertisers put comments in the HTML delimiting their advertisement. i.e.  .... Write a 'regex' rule to match this. i.e.:
```
strip regex // add regex //
```
The tag matcher tries hard to obey document structure, and is usually faster than a straight regex. If you find yourself working hard to write complicated regex rules, consider using the tag matcher instead.
Try hard to make the rules you write work on as many sites as possible. Avoid writing rules for one specific page (unless you visit it often). You might find yourself writing rules for every page you visit!
If you're trying to match something, and can't figure out how to do it with the existing matchers and options, let me know!

Speed Considerations

Turn on "timing" on the main page and look at the log. Each Rewrite rule will generate a line like:
```
[25163 Mon Mar 12 14:48:27 2001]   Rewrite: ADS took 0.48920 seconds, 381 failed, 2 successful
```
FilterProxy should be roughly O(n) in the number of "failed" matches listed. FilterProxy is also roughly O(n) in the number of "successful" matches listed (but we don't care how long that takes, right?) A failed match is considered the number of times the tag name matched, but the attributes did not, when using the tag matcher. (now you see why the ADS rule is so slow...because it tries to find ads by looking at the <a> tag)
The first matcher should be relatively unique in the document. To a very good approximation, the amount of time a filtering rule will take is proportional to the number of times the first matcher matches. (in other words, FilterProxy is O(n) in the number of times the first matcher matches) For example:
```
strip tag <a> containing attrib <a href=~/doubleclick.net/>
```
would be very slow (for most documents that contain lots of <a> tags), but:
```
strip tag <a href=~/doubleclick.net/>
```
would be much faster (by a factor of 3, by my tests). But an even faster way is:
```
strip tag <img src=~/doubleclick.net/> add tagblock <a>
```
Since most documents contain many <a> tags, both of the first two examples will be pretty slow. Assuming the document contains more <a> tags than <img> tags, the last example will be fastest.
Note that to see a speed improvement using the regex finder instead of the tag finder, the string matched by the regex matcher must be unique in the document, and an equivalent tag matcher would have very many false matches. (where the tag name matches, but the attributes of the tag do not)
Try to write as few rules as possible. If it takes n milliseconds to traverse an entire document, and you have m rules, then it will take n*m milliseconds to go through all of them. (in other words, FilterProxy is O(n) in the number of rules) Now you see why the ADS rule is so ugly...if I split it up into several rules, it would be even slower.

Terse syntax description:

  The basic syntax is:
    [NAME:] command matcher [[qualifying predicate] [expanding predicate] ...]

    Note: [] means "optional", {} means "mandatory", ... means "more than one"
          <> are literal, and must be included as part of the rule.
  Commands:
    strip                         remove from file
    rewrite {matcher} as {html}   change matched text to something else
    ignore {NAME} [...]           ignore a named rule (can specify more than one)

  Expanding Predicates: (modifiers that expand an existing match)
    add {matcher}
          grow match to include text matched by [matcher] (use a matcher below)
          (can apply more than once, order matters)  if [matcher] is one of
          (tag, tagblock, regex, attrib), the match will grow forward from the
          previous point until it finds [matcher].

  Qualifying Predicates: (modifiers that also must match in order to consider it a match)
    inside {matcher}
          like encloser, except that the match that preceeds it must be *inside*
          the match that follows it.  This does not change the original match
          (use add encloser instead if you want to strip the thing it's inside).

    containing {matcher}
          the matched block must contain {matcher}

  Matchers:  Each matcher "finds" a block of text that gets passed to the
      predicates that follow it.

    tag [options] <{tagname} [attrib[=value]] [...]>  
          Will grab all content enclosed by tag Any of tagname, attrib, or
          value can be a regular expression by enclosing them in one of these
          regex delimiters: [/#%&!,=:].

    tagblock <{tagname} [attrib[=value]] [...]>  
          Matches to the closing tag corresponding to the tag specified. (like
          old 'tag -tagblock')

    attrib <{tagname} attrib[=value] [attrib[=value]] [...]>     
          Will grab the attribute specified.  Note that you can specify more
          than one attribute, and the *first* one is the one that will be
          stripped/rewritten, but the tagname must match and other attribs are
          required to be present.

    regex /regex/
          Match any (perl) regex.  Regex must be delimited by one of:
          [/#%&!,=:].  Note that this does matching (m//), not s///, 
          tr/// or y///. (yet)

    encloser <{tagname} [attrib[=value]] [...]> 
          Like tagblock, except that the block must enclose the previous match.
          (only makes sense as argument to 'add', and should really be named 
          "enclosing tag block" but that's too long)

    balanced                                      
          Grow match to include "balanced" tags that have the tag preceeding
          the match, and the corresponding closing tag trailing the match (with
          nothing in between).  Only makes sense as argument to 'add'.

    alternate                                     
          Grow the match to include "alternate content".  i.e. script/noscript,
          frame/noframe, layer/nolayer etc.  Only makes sense as argument to 
          'add'.

  In all cases more than one attrib can be specified.  You may chain as many
  matchers and predicates as you like, but if it starts to get too long it will
  probably be ambiguous not do what you might expect.  (I need a BNF form
  grammar for this syntax...)

Rewrite was written by Bob McElrath. Please see the README, BUGS, and any relevant module documentation before mailing me with problems.

Name	Action	Finder Operation	Submit
ADS		tagblock </a\|img\|i?layer\|i?frame\|script\|form/ /src\|href\|action/ =~ #(?:(?:ad(?:cafe\|_\|click\|buy\|count(?:er)?)?(?:serv\|link\|click\|verts?\|log\|graphic\|banner\|source\|mosaic\|intelligent\|\.cgi\|\.pl)\|blipverts?\|/ad(?:s\|-bin\|buy)?/\|banner(?:s?\.(?:cgi\|phtml\|php[0-9])\|man\|click)\|/(?:event\|html)\.ng/\|servfu\.pl\|/ads?[\._]\|(?:images?_?\|click_.x\.\|adstream_.x\.)ads?\|sponsor\|phpAds\|_ad\.html\|click-through\|www.amazon.com/exec/obidos/redirect-home\|prohosting.com/click)\|http://(?:remote)?ad(?:s\|s?[0-9]+\|server\|images?\|redir(?:ect)?\|_click)?\.\|(?:(?:link4link\|linkexchange\|flycast\|clicktrade\|doubleclick\|avenuea\|blockstackers\|mediaplex\|focalink\|valueclick\|onresponse\|imgis\|admaximize\|eads\|datis\|commission-junction\|pennyweb\|linkbuddies\|preferences\|dimeclicks\|futurenet\|247media\|netadsrv\.iworld\|clk4\|spanishbanner\|hibtox\|burstnet\|swiftad\|adzerver\|hamster\|ukbanners\|spunkmedia\|namezero\|ecoupons\|spinbox\|adclub\|advertising\|hitbox\|link4ads\|bfast\|gameadexchange\|adbureau\|linksynergy\|superstats\|st21\.yahoo\|phoenix-adrunner\.mycomputer\|counter\.xoom\|\w+ads\.osdn\|euniverseads)\.(?:com\|net)\|(?:freecity)\.de))#> add encloser </(?:no)?script/> add balanced add alternate add balanced
COMDELADS		regex /<!-- Auto Banner Insertion Begin -->/ add regex /<!-- Auto Banner Insertion Complete THANK YOU -->/
DUNNO		tagblock <ilayer name=userban visibility=hide> add regex </layer> add alternate add balanced
SCRIPTADS		regex /(ads\.freecity\.de\|flycast\.com)/ add encloser <script> add alternate add balanced
SCRIPTS2		tagblock <script src=~/banner/> add balanced add alternate add balanced
TABLEADS		tagblock </a/ href=~/(clk4\|hamster)\.com/> add encloser <table> add balanced add alternate
WEBBUGS		tag <img /width/=1 /height/=1> add encloser </(no)?script/> add alternate as <spacer width=1 height=1>