SWISH-SEARCH - Swish-e Searching Instructions
This page describes the process of searching with Swish-e. Please see the SWISH-CONFIG page for information the Swish-e configuration file directives, and SWISH-RUN for a complete list of command line arguments.
Searching a Swish-e index involves passing command line arguments to it that specify the index file to use, and the query (or search words) to locate in the index. Swish-e returns a list of file names (or URLs) that contain the matched search words. Perl is often used as a front-end to Swish-e such as in CGI applications, and perl modules exist to for interfacing with Swish-e.
The -w
command line argument is used specify the search query to Swish-e.
swish-e -w airplane
will find all documents that contain the word airplane.
When running Swish-e from a shell prompt, be careful to protect your query from shell metacharacters and shell expansions. This often means placing single or double quotes around your query. See "Searching with Perl" if you plan to use Perl as a front end to Swish-e. In the examples below single quotes are used to protect the search from the shell.
The following section describes various aspects of searching with Swish-e.
You can use the Boolean operators and, or, near or not in searching. Without these Boolean operators Swish-e will assume you're and'ing the words together. The operators are not case sensitive. These three searches are the same:
swish-e -w foo bar
swish-e -w bar foo
swish-e -w foo AND bar
[Note: you can change the default to oring by changing the variable DEFAULT_RULE in the config.h file and recompiling Swish-e.]
The not operator inverts the results of a search.
swish-e -w not foo
finds all the documents that do not contain the word foo.
Parentheses can be used to group searches.
swish-e -w 'not (foo and bar)'
The result is all documents that have none or one term, but not both.
To search for the words and, or, near or not, place them in a double quotes. Remember to protect the quotes from the shell:
swish-e -w '"not"'
swish-e -w \"not\"
will search for the word "not".
Other examples:
swish-e -w smilla or snow
Retrieves files containing either the words "smilla" or "snow".
swish-e -w smilla snow not sense
swish-e -w '(smilla and snow) and not sense' (same thing)
retrieves first the files that contain both the words "smilla" and "snow"; then among those the ones that do not contain the word "sense".
The near keyword is similar to and but implies a proximity between the words. The near keyword takes a integer argument as well, indicating the maximum distance between two words to consider a valid match.
Example:
swish-e -w smilla near5 snow
would match the document if the words smilla
and snow
appeared within 5 positions of one another.
A near search with no argument or argument of 0 is the same as an and search.
Two different wildcard characters are available, each evoking different behaviour.
The *
means "match zero or more characters."
The ?
means "match exactly one character."
The wildcard *
may only be used at the end of a word. Otherwise *
is considered a normal character (i.e. can be searched for if included in the WordCharacters directive).
Example:
swish-e -w librarian
this query only retrieves files which contain the given word.
On the other hand:
swish-e -w 'librarian*'
retrieves "librarians", "librarianship", etc. along with "librarian".
Note that wildcard searches combined with word stemming can lead to unexpected results. If stemming is enabled, a search term with a wildcard will be stemmed internally before searching. So searching for running*
will actually be a search for run*
, so running*
would find runway
. Also, searching for runn*
will not find running
as you might expect, since running
stems to run
in the index, and thus runn*
will not find run
.
The ?
wildcard matches exactly one character, but may not be used at the start of a word.
Example:
swish-e -w 's?ow'
will match snow
, slow
and show
but not strow
.
This:
swish-e -w '?how'
will throw an error.
In general, the order of evaluation is not important. Internally swish-e processes the search terms from left to right. Parenthesis can be used to group searches together, effectively changing the order of evaluation. For example these three are the same:
swish-e -w foo not bar baz
swish-e -w not bar foo baz
swish-e -w baz foo not bar
but these two are not the same:
swish-e -w foo not bar baz
swish-e -w foo not (bar baz)
The first finds all documents that contain both foo and baz, but do not contain bar. The second finds all that contain foo, and contain either bar or baz, but not both.
It is often helpful in understanding searches to use the boolean terms and parenthesis. So the above two become:
swish-e -w foo AND (not bar) AND baz
swish-e -w foo AND (not (bar AND baz))
These four examples are all the same search (assuming that AND is the default search type):
swish-e -w 'juliet not ophelia and pac'
swish-e -w '(juliet) AND (NOT ophelia) AND (pac)'
swish-e -w 'juliet not ophelia pac'
swish-e -w 'pac and juliet and not ophelia'
Looking at the the first three searches, first Swish-e finds all the documents with "juliet". Then it finds all documents that do not contain "ophelia". Those two lists are then combined with the boolean AND operator resulting with a list of documents that include "juliet" but not "ophelia". Finally, that list is ANDed with the list of documents that contain "pac" resulting.
However it is always possible to force the order of evaluation by using parenthesis. For example:
swish-e -w 'juliet not (ophelia and pac)'
retrieves files with "juliet" that do not contain both words "ophelia" and "pac".
MetaNames are used to represent fields (called columns in a database) and provide a way to search in only parts of a document. See SWISH-CONFIG for a description of MetaNames, and how they are specified in the source document.
To limit a search to words found in a meta tag you prefix the keywords with the name of the meta tag, followed by the equal sign:
metaname = word
metaname = (this or that)
metaname = ( (this or that) or "this phrase" )
It is not necessary to have spaces at either side of the "=", consequently the following are equivalent:
swish-e -w "metaName=word"
swish-e -w "metaName = word"
swish-e -w "metaName= word"
To search on a word that contains a "=", precede the "=" with a "\" (backslash).
swish-e -w "test\=3 = x\=4 or y\=5"
this query returns the files where the word "x=4" is associated with the metaName "test=3" or that contains the word "y=5" not associated with any metaName.
Queries can be also constructed using any of the usual search features, moreover metaName and plain search can be mixed in a single query.
swish-e -w "metaName1 = (a1 or a4) not (a3 and a7)"
This query will retrieve all the files in which "a1" or "a2" are found in the META tag "metaName1" and that do not contain the words "a3" and "a7", where "a3" and "a7" are not associated to any meta name.
To search for a phrase in a document use double-quotes to delimit your search terms. (The phrase delimiter is set in src/swish.h.)
You must protect the quotes from the shell.
For example, under Unix:
swish-e -w '"this is a phrase" or (this and that)'
swish-e -w 'meta1=("this is a phrase") or (this and that)'
Or under Windows:
swish-e -w \"this is a phrase\" or (this and that)
You can not use boolean search terms inside a phrase. That is:
swish-e -w 'this and that'
finds documents with both words "this" and "that", but:
swish-e -w '"this and that"'
finds documents that have the phrase "that and that". A phrase can consist of a single word, so this is how to search for the words used as boolean operators:
swish-e -w 'this "and" that'
finds documents that contain all three words, but in any order.
You can use the -P
switch to set the phrase delimiter character. See SWISH-RUN for examples.
At times you might not want to search for a word in every part of your files since you know that the word(s) are present in a particular tag. The ability to search according to context greatly increases the chances that your hits will be relevant, and Swish-e provides a mechanism to do just that.
The -t option in the search command line allows you to search for words that exist only in specific HTML tags. Each character in the string you specify in the argument to this option represents a different tag in which the word is searched; that is you can use any combinations of the following characters:
H means all<HEAD> tags
B stands for <BODY> tags
t is all <TITLE> tags
h is <H1> to <H6> (header) tags
e is emphasized tags (this may be <B>, <I>, <EM>, or <STRONG>)
c is HTML comment tags (<!-- ... -->)
# This search will look for files with these two words in their titles only.
swish-e -w "apples oranges" -t t
# This search will look for files with these words in comments only.
swish-e -w "keywords draft release" -t c
This search will look for words in titles, headers, and emphasized tags.
swish-e -w "world wide web" -t the
Perl ( http://www.perl.com/ ) is probably the most common programming language used with Swish-e, especially in CGI interfaces. Perl makes searching and parsing results with Swish-e easy, but if not done properly can leave your server vulnerable to attacks.
When designing your CGI scripts you should carefully screen user input, and include features such as paged results and a timer to limit time required for a search to complete. These are to protect your web site against a denial of service (DoS) attack.
Included with every distribution of Perl is a document called perlsec -- Perl Security. Please take time to read and understand that document before writing CGI scripts in perl.
Type at your shell/command prompt:
perldoc perlsec
If nothing else, start every CGI program in perl as such:
#!/usr/local/bin/perl -wT
use strict;
That alone won't make your script secure, but may help you find insecure code.
There are many examples of CGI scripts on the Internet. Many are poorly written and insecure. A commonly seen way to execute Swish-e from a perl CGI script is with a piped open. For example, it is common to see this type of open()
:
open(SWISH, "$swish -w $query -f $index|");
This open()
gives shell access to the entire Internet! Often an attempt is made to strip $query
of bad characters. But, this often fails since it's hard to guess what every bad character is. Would you have thought about a null? A better approach is to only allow in known safe characters.
Even if you can be sure that any user supplied data is safe, this piped open still passes the command parameters through the shell. If nothing else, it's just an extra unnecessary step to running Swish-e.
Therefore, the recommended approach is to fork and exec swish-e
directly without passing through the shell. This process is described in the perl man page perlipc
under the appropriate heading Safe Pipe Opens.
Type:
perldoc perlipc
If all this sounds complicated you may wish to use a Perl module that does all the hard work for you.
The Swish-e distribution includes a Perl module called SWISH::API. SWISH::API provides access to the Swish-e C Library.
The SWISH::API module is not installed by default.
The SWISH::API module will embed Swish-e into your perl program so that searching does not require running an external program. Embedding the Swish-e program into your perl program results in faster Swish-e searches, especially when running under a persistent environment like mod_perl since it avoids the cost of opening the index file for every request (mod_perl is much also much faster than CGI because it avoids the need to compile Perl code for every request).
See the README file in the perl directory of the Swish-e distribution for installation instructions. Documentation for the SWISH::API module is available at http://swish-e.org and is installed along with other HTML documentation on your computer.
$Id$
.