FilterProxy

The latest version is 0.30 released January 13, 2002 12:03am CDT: ChangeLog
Download it: in tar.gz format or rpm format.
Required perl packages can be found on CPAN or as rpm's at your favorite redhat mirror, in the /powertools/CPAN directory. Or as deb's using apt-get! (see the INSTALL file)
Check the README
Mail the Author: Bob McElrath <bob+filterproxy@mcelrath.org>.
SourceForge LogoSourceForge project page

0.30 Is here! (finally)

[Sat Jan 12, 2002] After a long wait, FilterProxy 0.30 is finally here. This version has the change from Parse::ePerl to HTML::Mason. If you tried to install FilterProxy before and were unable because of issues with Parse::ePerl or perl 5.6, you should have no trouble with this version. Other exciting changes in this version include a view-source like functionality that marks-up pieces of the document that were filtered. Since this is a little difficult to explain, it's best to just see it. With the included javascript bookmarks, you can now see how a page was filtered with one click, and also edit the configuration for that page. An XSLT module has been contributed by Mario Lang. XSLT will let you transform XML/HTML by examining the file's structure and writing an XML stylesheet. For more info on XSLT, Take a look at this XSLT tutorial..

What is FilterProxy?

FilterProxy is a generic http proxy with the capability to modify proxied content on the fly. It has a modular system of filters which can modify web pages. The modular system means that many filters can be applied in succession to a web page, and configuration is easy and flexible. FilterProxy can proxy any data served by the HTTP protocol (i.e. anything off the web), and filter any recognizable mime-type. All configuration is done via web-based forms, or editing a configuration file. It was created to fix some of the annoyances of poor web design by rewriting it. It also can improve the web for you, in both speed (Compress) in quality (Rewrite/XSLT). After ads (and their graphics) are stripped out, and html is compressed, surfing over a modem is much faster. Compare to Muffin (a similar project in java), and WebCleaner (a similar project in python) in purpose and functionality. FilterProxy is written in perl, and is quite fast.

(NEW!) Also check out my list of ways to fix web/Netscape annoyances that don't involve filtering. (currently small fonts, blink, and javascript popup windows)

Ok, ok, now what the hell does it really do?

Modules that are currently written are:

Rewrite
Allows web pages to be rewritten in arbitrary ways. This means that advertisements can be removed from really complex pages. It also will let you reformat the layout. It will allow you to remove tags, modify the attributes to tags, or remove or change entire sections. Practically, this means it can remove that damn <blink> tag:
strip tag <blink>
change <font size=1> to something larger and more readable:
rewrite attrib <font size=1> as size="-1"
(Ideally, remove all absolute font sizes, and replace them with relative ones. Why do so many web pages do this to me?) It also removes web bugs:
rewrite tag <img /width/=1 /height/=1> add encloser </(no)?script/> add alternate as <spacer width=1 height=1>
which are usually 1x1 gifs that advertisers use to track you (and slow down your browsing severly when over a modem!) For a good description of web bugs, check out this Washington Post article. (It will say it can't find the article...just hit reload and it will show up). Most importantly, it can remove ads (even javascript ones!):
strip regex #(ads\.freecity\.de|flycast\.com|/RealMedia/ads/)# inside tagblock <script> add alternate add balanced
These are rewrite rules, and just a hint of the power with which you can rewrite web pages you visit.
XSLT(NEW!)
XSLT stands for XML Stylesheet Language Transformations, and it transforms one XML document into another XML document. With the XSLT module you can apply XSL transformations to HTML. Here is a tutorial on XSLT Basically you can rewrite HTML documents by examining the structure of the document. But it XSLT does not have the matching power of regular expressions, so it is complementary to the Rewrite module.
Compress
Compresses web pages. This can lead to a 5x speed improvement if you are surfing over a modem, and can arrange to have FilterProxy running on a server with a direct net connection
Header
Filter HTTP headers in arbitrary ways. This means it can anonymize your requests (removing User-Agent, Referer), and filter cookies by domain. i.e. don't accept or send cookies to any known advertiser's domain. It can remove any header (including regexp matching of header names), and add arbitrary headers.
De-Anim
De-animates animated gifs, and removes other "extension blocks", generally making them smaller as well as de-animating them.
Skeleton
Example module (heavily commented) for people interested in extending FilterProxy by writing new modules

Modules that may be coming, as the author has time (or volunteers help!)

FilterCookie
Keep a "cookie jar". Rather than Netscape keeping all your cookies, FilterCookie will take care of it instead. It will also allow you to easily view your cookie jar and remove cookies from it. It will be able to filter cookies on a site-by-site basis.
Mapper
Map URL's to other URL's. This might be useful to get "printer-friendly" versions of articles from various news sites, and to block requests for images from known advertisers domains (in case they slip through Rewrite).
Mirror
Cache a local copy of images from sites you visit often. Possibly rewrite img references to be local (i.e. http://... -> file://...)
ProfileAgent
  1. Acts as a Netscape roaming profile server, so that it stores bookmarks and netscape preferences.
  2. Allows web-based access to a user's bookmarks (so friends can see them).
  3. Has a "search engine" which indexes content on bookmarked pages (and pages linked to from a bookmarked page, on the same server), so that you can find things in your bookmarks by a search through this module's interface.
  4. Allows "classification" of bookmarks in a more sophisticated manner (preferably by keyword, rather than tree), and then can generate yahoo-like indexes by keyword. (or by searches for a keyword).

For instance, I might bookmark the homepage for xmms (http://www.xmms.org/) which I would then classify by adding the keywords (mp3, linux, audio, music, eyecandy, earcandy, X11, software). Then when I do a search for "software" using this module's interface, I get all items which have the keyword, including xmms. If I search for "linux sofware", I get all things with these keywords, etc. You get the idea. (You could make a yahoo-like index, or filesystem-like path by joining keywords "/linux/software/mp3" note this is the same as "/software/linux/mp3") (Does anyone else but me have thousands of bookmarks, and occasionally think "I saw a piece of software that does X", and then spend 2 hours manually searching your bookmarks?)

For bonus points, add a web spider that will search documents linked from the bookmarked page, and add them to the search engine's database. (This way you could find info by searching that you've never seen, but is closely related to something you've bookmarked).

For bonus bonus points, add the capability for the spider to use Netscape's "What's Related" (or similar) interface to find things similar to the page bookmarked, and index them too.

For bonus bonus bonus points, make sure this doesn't get exploited by advertisers.

This could be an entire thesis project on software agents. Any takers?

Ok, enough rambling, how do I use it?

Well, first download it. It requires perl, and several modules from CPAN (See the INSTALL file).

After getting it running, tell your browser to us e the proxy. Under netscape, select the menu item Edit->Preferences. Then, in the preferences dialog box, select Advanced->Proxies. (You may have to click the little arrow next to advanced to get netscape to expand the menu). Then select "Manual proxy configuration", and put in the "HTTP Proxy" field the host and port on which you ran FilterProxy. If you haven't edited FilterProxy.pl, this should be 'localhost' and '8888'.

So now what?

What do I do if something goes wrong?


FilterProxy and this page are © Copyright Bob McElrath. Last modified Friday July 20 22:07:00 CDT 2001