Squirm - A redirector for Squid
Squirm is a fast & configurable redirector for the Squid Internet Object Cache. It requires the GNU Regex Library (now included in the Squirm source), and of course, a working Squid. It is available free under the terms of the GNU GPL.
Squirm has the following features:
I started writing it because the existing redirector scripts used too much memory and all were too slow for Squids that receive a lot of requests.
On my Pentium Pro 200 running Linux, it manages to do 16,440 lines per second (that's 59 million lines per hour!) using my squirm.local and squirm.patterns config files.
It can handle nifty things like file mirrors with the regex pattern replacement strings, and do site blocking - useful for schools. It could also do such things as banner add rewriting, and just about anything else :-)
The latest version is squirm-1.0betaB which you can download as a normal tar file or a gzipped tar file.
The most recent version is always available from this page at http://squirm.foote.com.au/
cd regex ./configure make clean make cp -p regex.o regex.h ..This step is a bit ugly - I welcome anyone who has experience with the configure script to incorporate this directly into Squirm - Anyone ?
orbit:/usr/local/src/squirm-1.0betaB# whoami root orbit:/usr/local/src/squirm-1.0betaB# /usr/local/squirm/bin/squirm Squirm running as UID 0: writing logs to stderr Wed Mar 11 13:20:37 1998:unable to open local addresses file [/usr/local/squirm/etc/squirm.local] Wed Mar 11 13:20:37 1998:unable to open redirect patterns file Wed Mar 11 13:20:37 1998:Invalid condition - continuing in DODO mode Wed Mar 11 13:20:37 1998:Squirm (PID 29760) started [Crtl + C] |
redirect_program /usr/local/squirm/bin/squirm redirect_children 10the number of children is dependant on the load on your squid box. Try 10 and use the cachemgr.cgi CGI to see if all redirector processes get used, and if they do, you can raise this number.
By default, the two config files are located as /usr/local/squirm/etc/squirm.local and /usr/local/squirm/etc/squirm.patterns You need to create these two files from scratch with the aid of the following instructions:
127.0.0 10.2.3 192.168.1
These are used to determine if Squirm should rewrite a URL. You wouldn't normally want any Squid neighbours to be able to use your redirector as the extra load of ICP requests would bog down your machine, so don't include them in the file.
For the above config file, requests to the Squid from 10.2.3.4 would be accepted, whilst requests from 1.2.3.4 would be ignored.
There is currently no plan to implement CIDR notation because Squirm uses simple integer comparisons to make lookups really quick.
The syntax of lines in the squirm.patterns file are of the form:
regex|regexi pattern replacement [[^]accelerator_string[$]]or
abort .filename_extension
Full regex matching and replacement is made available by the use of the GNU Regex libary. It also supports pattern buffers.
Let's say you want to redirect requests to a local URL for a common file, where it's matched case sensitively:
regex ^.*/n32e301\.exe$ http://www.mydomain/path_to/n32e301.exethis means: replace URLs ending in /n32e301.exe with the URL of your local copy.
To do the same as above except case insensitively, you would use regexi instead of regex at the start of the line.
The accelerator string is used to avoid regex comparisons of URLs unless they are close to the pattern expected. Squirm first compares a URL against the accelerator string before it bothers do do a proper regex comparson, and saves many CPU cycles on a busy machine. Note: you should always use accelerator strings if possible on a busy box!
For the above example, a speedup is acheived through the use of the accelerator string n32e301.exe$, so the line would look like:
regex ^.*/n32e301\.exe$ http://www.mydomain/path_to/n32e301.exe n32e301.exe$
The accelerator string can have a leading caret '^' OR a trailing dollar '$' to indicate that the rough match should search at the start of end of the URL respectively.
The reason behind use of the abort extension is a massive speedup by aborting pattern searches for URLs that end in a certain filename extension. (Why traverse the entire patterns list and do comparisons when they won't be matched anyway ?)
Let's say we don't need to traverse the list for files ending in .gif. The line needed is:
abort .gif
regexi ^http://tucows\.[^/]*/(.*$) http://tucows.mymirror.com/\1 ^http://tucows. abort .gif abort .html abort .jpg abort .htm regex .*/c16e401\.jar$ http://redirector1.senet.com.au/c16e401.jar c16e401.jar$ regexi .*/c32e401\.jar$ http://redirector1.senet.com.au/c32e401.jar c32e401.jar$ regex .*/cb16e401\.exe$ http://redirector1.senet.com.au/cb16e401.exe cb16e401.exe$ regex .*/cb32e401\.exe$ http://redirector1.senet.com.au/cb32e401.exe cb32e401.exe$ regex .*/cc16e401\.exe$ http://redirector1.senet.com.au/cc16e401.exe cc16e401.exe$ regex .*/cc32e401\.exe$ http://redirector1.senet.com.au/cc32e401.exe cc32e401.exe$ |
The first line contains an accelerator string ^http://tucows. so Squirm has to do the regex comparison only if the URL matches it. Because this is the first line in the squirm.patterns file, much time is saved by not having to do a regex comparison for every single URL. (Accelerator strings are not compulsary on a config line, but the speed improvement is quite large.)
The first regex comparison uses a case insensitive pattern which matches HTTP for any hostname beginning with tucows. It stores the path information in a pattern buffer which is later replayed in the URL replacement by using \1 (up to 10 replays possible)
The abort extensions are used so that comparisons for none of the following lines continues unless they don't match filenames listed in the abort lines. It is wise to include the most frequent filename extension of requests in cases where the abort extension can be used, but not filename extensions that occur infrequently. .gif, .jpg, .html, .htm are good candidates for the abort extension.
You may wish to have a way of blocking access to sites which contain material unsuitable for viewing by children and return them a web page which let's them know they have requested a site which is blocked.
regexi ^http://www\.playboy\.com/.* http://www/notallowed.html regexi ^http://www\.xxx\.com/.* http://www/notallowed.html
This will return the URL http://www/notallowed.html to anyone requesting URLs starting with http://www.playboy.com or http://www.xxx.com
For long lists for sites to block the use of accelerator strings may help, in which case the above example would be:
regexi ^http://www\.playboy\.com/ http://www/notallowed.html ^http://www.playboy.com regexi ^http://www\.xxx\.com/ http://www/notallowed.html ^http://www.xxx.com
If you would like to include the blocked URL requested in the resulting page (something like "The URL http://www.playboy.com/file.jpg has been blocked", you could create a CGI which takes the URL as an argument, and add the request to the pattern replacement.
regexi ^(http://www\.playboy\.com/.*) http://www/cgi-bin/na?url=\1This might be a good choice for a list of hostnames you may already have to add to the list, for example:
cat list-of-banned-sites \ | sed -e "s/\./\\\./g" \ | awk '{ print "regexi ^(http://" $1 "/.*) http://www/cgi-bin/na?url=\1" }' \ >> /usr/local/squirm/etc/squirm.patterns
Again, adding accelerator strings to long lists may help with speed.
When Squirm is run as root, it goes into interactive mode which echoes all information that would normally be logged to standard error output. This gives the opportunity to test a configuration file modification before restarting the current squirm processes on the machine.
Optionally, you can supply the path of a squirm patterns config file, if it's not in the default location, for the first argument.
Squid sends requests to the standard input of a redirector process with the form:
URL src_address/hostname ident methodThe ident field is usually a dash '-'. The hostname is normally a dash too, since squid is normally configured not to look up hostnames for proxy requests. For Squirm to do any redirection, the method is GET and the src_address must match an address from the squirm.local file.
The following text is an example of running squirm interactively, with the input lines to test marked in bold:
frog:~\:# whoami root /usr/local/squirm/bin/squirm Squirm running as UID 0: writing logs to stderr Tue Mar 10 22:00:34 1998:Loading IP List Tue Mar 10 22:00:34 1998:Reading Patterns from config /usr/local/squirm/etc/squirm.patterns Tue Mar 10 22:00:34 1998:Squirm (PID 16955) started http://tucows.com/downloads/win95/n32e301p.exe 127.0.0.1/- - GET http://tucows.senet.com.au/downloads/win95/n32e301p.exe 127.0.0.1/- - GET Tue Mar 10 22:00:57 1998:http://tucows.com/downloads/win95/n32e301p.exe:http://tucows.senet.com.au/downloads/win95/n32e301p.exe http://www.somewhere.com/path/file 127.0.0.1/- - GET http://www.somewhere.com/path/file 127.0.0.1/- - GET [Ctrl + D] |
/usr/local/squirm/bin/squirm < filename
There are several log files in /usr/local/squirm/logs which are normally only viewable by the squid user id and root:
When you have modified either squirm.local or squirm.patterns all of the running squirm processes need to be restarted by a HUP signal.
(Restarting Squid will do this (by sending squid a HUP signal), but this usually isn't convenient because it makes squid become unvailable for a period of time.)
Under Linux, you can do this by typing:
killall -HUP squirm
On other systems you may have to write a small script:
#!/bin/sh for PID in `ps -aux | grep redirector | grep -v grep | awk '{ print $2 }'` do kill -HUP $PID done
Maintained by Chris Foote, chris@foote.com.au Copyright (C) 1998 Chris Foote & Wayne Piekarski If you find it useful, I'd like to know - please send email to chris@foote.com.au - Ta! Includes the GNU Regex library written by many authors - see regex/AUTHORS for details. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. Please see the file GPL in the source directory for full copyright information.
File Last Modified: Aug 21 2005
This site is sponsored by Inetd and HostExpress, written by Chris Foote.
. .