Back to FilterProxy main configuration.
Name | Action | Finder Operation | Submit |
---|
A "Matcher" (currently either tag
, attrib
, or
regex
) is applied to the file to find the content desired. So a
Matcher like
tag <img src>
regex /blue chickens/
tag </(a|img)/ /(src|href)/> add tagblock <script>
add
will then expand the match to include a
<script> block that follows the initial <a href> or <img
src>. You can use the encloser
matcher insead of
tag
to cause it to grow to a <script> block that
encloses the initial <a href> or <img src>.
add alternate
adds "alternate content". In other words, if
you match a <script> block, it's alternate content is a
<noscript> block. This is usually used to show banner ads to browsers
which don't support javascript, or don't have it turned on. Often it's easy
to match the ad inside <noscript> but almost impossible to match a
javascript ad. Since these are often right next to each other in a page,
alternate
will consider them one block. alternate also knows
about <layer>, <ilayer> and it's alternate content
<nolayer>, and <frame> and it's alternate content
<noframe>.
add balanced
adds "balanced enclosers". In other words, if
you match <img src=...> and it has a <center> preceeding it and a
</center> trailing it, balanced
will consider the center
tags part of the match. It continues adding in balanced enclosers until it
reaches a leading tag that does not have a corresponding closing tag trailing
the match. balanced
ignores whitespace, comments, and a few
other tags like <br> and <p>.
Clever combinations of add, balanced, alternate, and encloser can make most pages look like it never had an ad.
Once your match is found it is either stripped or rewritten. Strip should
be obvious (removes the match from the page). Rewrite requires the matcher
to be followed by as [block]
. The match will be replaced by the
text following the as
keyword. No interpretation is done of the
as part, it is simply replaced verbatum.
Each rule can be named, so that if a rule BADRULE destroys the layout of
one page, you can create a site regex for it (on the FilterProxy main page)
which will contain the rule ignore BADRULE
. This will cause
BADRULE to not be applied to sites matching that site regex. You don't have
to name your rules if you don't want to. You can even name
ignore
rules, so that you can ignore your ignore rules. But
that is probably a little silly. Rules and ignores are processed in
alphabetical order, so if you want one rule or ignore to be processed before
another, you can preceed the name with a number (i.e. 1_MYRULE), or just
name it something that comes before it alphabetically.
add
to expand the match to include
any javascript, tables, or forms. Do not, for instance, write a rule
like:strip tag <img width=468 height=60>since many sites have 468x60 images that are not ads.
strip regex /<!-- +Begin Ad +-->/ add regex /<!-- +End Ad +-->/
[25163 Mon Mar 12 14:48:27 2001] Rewrite: ADS took 0.48920 seconds, 381 failed, 2 successfulFilterProxy should be roughly O(n) in the number of "failed" matches listed. FilterProxy is also roughly O(n) in the number of "successful" matches listed (but we don't care how long that takes, right?) A failed match is considered the number of times the tag name matched, but the attributes did not, when using the
tag
matcher. (now you
see why the ADS rule is so slow...because it tries to find ads by looking
at the <a> tag)
strip tag <a> containing attrib <a href=~/doubleclick.net/>would be very slow (for most documents that contain lots of <a> tags), but:
strip tag <a href=~/doubleclick.net/>would be much faster (by a factor of 3, by my tests). But an even faster way is:
strip tag <img src=~/doubleclick.net/> add tagblock <a>Since most documents contain many <a> tags, both of the first two examples will be pretty slow. Assuming the document contains more <a> tags than <img> tags, the last example will be fastest.
regex
finder instead of the tag
finder, the string matched by
the regex
matcher must be unique in the document, and an
equivalent tag
matcher would have very many false matches.
(where the tag name matches, but the attributes of the tag do not)
The basic syntax is: [NAME:] command matcher [[qualifying predicate] [expanding predicate] ...] Note: [] means "optional", {} means "mandatory", ... means "more than one" <> are literal, and must be included as part of the rule. Commands: strip remove from file rewrite {matcher} as {html} change matched text to something else ignore {NAME} [...] ignore a named rule (can specify more than one) Expanding Predicates: (modifiers that expand an existing match) add {matcher} grow match to include text matched by [matcher] (use a matcher below) (can apply more than once, order matters) if [matcher] is one of (tag, tagblock, regex, attrib), the match will grow forward from the previous point until it finds [matcher]. Qualifying Predicates: (modifiers that also must match in order to consider it a match) inside {matcher} like encloser, except that the match that preceeds it must be *inside* the match that follows it. This does not change the original match (use add encloser instead if you want to strip the thing it's inside). containing {matcher} the matched block must contain {matcher} Matchers: Each matcher "finds" a block of text that gets passed to the predicates that follow it. tag [options] <{tagname} [attrib[=value]] [...]> Will grab all content enclosed by tag Any of tagname, attrib, or value can be a regular expression by enclosing them in one of these regex delimiters: [/#%&!,=:]. tagblock <{tagname} [attrib[=value]] [...]> Matches to the closing tag corresponding to the tag specified. (like old 'tag -tagblock') attrib <{tagname} attrib[=value] [attrib[=value]] [...]> Will grab the attribute specified. Note that you can specify more than one attribute, and the *first* one is the one that will be stripped/rewritten, but the tagname must match and other attribs are required to be present. regex /regex/ Match any (perl) regex. Regex must be delimited by one of: [/#%&!,=:]. Note that this does matching (m//), not s///, tr/// or y///. (yet) encloser <{tagname} [attrib[=value]] [...]> Like tagblock, except that the block must enclose the previous match. (only makes sense as argument to 'add', and should really be named "enclosing tag block" but that's too long) balanced Grow match to include "balanced" tags that have the tag preceeding the match, and the corresponding closing tag trailing the match (with nothing in between). Only makes sense as argument to 'add'. alternate Grow the match to include "alternate content". i.e. script/noscript, frame/noframe, layer/nolayer etc. Only makes sense as argument to 'add'. In all cases more than one attrib can be specified. You may chain as many matchers and predicates as you like, but if it starts to get too long it will probably be ambiguous not do what you might expect. (I need a BNF form grammar for this syntax...)