NAME

CHANGES - List of revisions

OVERVIEW

This document contains list of bug fixes and feature additions to Swish-e.

Version 2.4.8 - xxx

example/templates/search.tt fix for escaped slashes

Per http://swish-e.org/archive/2008-06/12180.html

index format changed to use all 64-bit sizes

A global change was made to normalize index sizes for 64 bits.

This requires reindexing if you are upgrading from a previous version.

fixed integer overflow issues

Issues included:

 * http://dev.swish-e.org/ticket/14
added -D prop_delimiter option

The new -D option at indexing time lets you specify an alternate char delimiter when property values are appended to one another at indexing time. The default value is the same as it always has been: a single space.

To match the libswish3 behavior, pass the ETX (\x03) ascii control character like this:

 swish-e -D '\x03'

It's up to your search code to process the property values accordingly.

yanked SWISH-Stemmer-0.05.tar.gz

The tar file of SWISH::Stemmer in the example dir has been broken for years, but since the module itself is severely deprecated we haven't bothered with it. Till now. Because it was preventing the good people of the fedora project from building their RPMs.

Version 2.4.7 - 4 April 2009

Added ReturnRawRank for raw rank score

Setting ReturnRawRank to a true value will return the rank score unscaled. Can be set with the -a command line option (mnemonic: "a"bsolute rank score).

Yanked setenv feature introduced in 2.4.6

The ranking debugging feature using setenv introduced in 2.4.6 was yanked. Some platforms (notably HP-UX and Windows) lack the setenv feature, and the convenience of setting the env var was not worth the limitations.

Version 2.4.6 - 10 March 2008

MinWordLength respected in query parser

Clark Vent reported that the query parser was not respecting MinWordLength settings. See http://dev.swish-e.org/changeset/2145

Patch to file.c.

The file.c patch was in response to http://swish-e.org/archive/2007-03/11321.html although that user never responded about that patch.

SWISH_DEBUG_RANK env var now enables rank debugging

Set SWISH_DEBUG_RANK to a true value to enable lots of rank debugging on stderr.

Perl Makefile.PL patched to fix MakeMaker issue

Recent versions of ExtUtils::MakeMaker revealed a bug in Makefile.PL. Patch from mschwern via RT, report by mpeters.

LARGEFILE support detected automatically in configure

jrobinson852@yahoo.com suggest LARGEFILE support be auto-detected since it is needed so often on Linux systems.

New Snowball stemmers

Trygve Falch contributed patches to update the Snowball stemmers, including new Hungarian and Romanian stemmers.

Patched leaks

Anthony Dovgal patched two leaks. One when there's a failure to open a file the file name was not freed.

SwishSetSearchLimit() was nulling the search limits when an error was found in the parameters, but not freeing the existing limits.

Leak in SwishResetSearchLimit

Fixed a leak if a limit was set and then reset but not prepared. Patch provided by Antony Dovgal.

New API functions added

Added SwishGetStructure() and SwishGetPhraseDelimiter() functions which return relevant properties of the search object. Patch provided by Antony Dovgal.

Version 2.4.5 - 22 Jan 2007

Fixed 'deflate' handling in spider.pl

spider.pl was using the wrong method do uncompress HTTP responses that were 'deflate' encoded. Also decode content based on the document's charset and encode back to charset before outputting.

re-indexing required

The magic numbers in src/swish.h were changed to require re-indexing from version 2.4.4 indexes. This should have been done in 2.4.4 as well, and anytime the index format changes. -- karman

fixed stemmer bug introduced in 2.4.4

stemmer.c had a mix up in the deprecated stemmer assignments for "Stemmer_en" and "Stem". Also fixed stemmer.h so that 2.4.3 indexes can be read correctly. -- karman

Now fork/exec to run filters

FileFilter* was using popen to run the filter, which could pass user data though the shell. Now uses fork/exec if fork is available which should be everywhere except Windows. In windows popen is used but all parameters are double-quoted. -- moseley

fixed signed/unsigned warnings from gcc 4.x

Cleaned up search.c to catch mismatched signedness warnings from newer GCC versions. This issue pre-existed 2.4.4 but the new wildcard features in search.c made for a lot more warnings. -- karman

Makefile.mingw included in distrib

Modified root Makefile to include the perl/Makefile.mingw file. -- karman

Version 2.4.4 - 11 Oct 2006

Version 2.4.4 RC1

Release Candidate 1 for 2.4.4, 2 Oct 2006.

quote fix for FileFilter config param

Ludovic Drolez contributed a patch to fix a quoting issue with filenames. This affects non-Windows builds only.

SWISH::Filter now on CPAN

SWISH::Filter is now available on http://cpan.org/. The version in the distribution is not kept in sync with the CPAN version. Install the CPAN version if you want the latest and greatest version.

SWISH::API updated to 0.04

Added several fixes, including:

Perlish method names from mpeters@plusthree.com
switched to XSLoader with DynaLoader as fallback
added VERSION method to satisfy some versions of MakeMaker
Fuzzify() method now actually works as advertised
added proximity feature and single character wildcard with '?' instead of '*'

Herman Knoops contributed these patches. See http://swish-e.org/archive/2006-05/10543.html

Error messages were also changed to better reflect correct use of wildcards.

fixed bug when using DoubleMetaphone

Fixed problem reported by Andreas Völter where a query that generated a two-word query with DoubleMetaphone fuzzy mode was not working.

fix sparc64 property issue

Sorithy Seng (pourlassi@gmail.com) submitted a patch against docprop.c to fix an issue on sparc64 platforms. It is unknown whether this bug affected other 64-bit architectures.

fixed bug when StopWords resulted in no unique words

Added check in db_native.c to check that some words exist before writing index.

updates to SWISH-RUN.1

Added doc for -u and -r options.

filename only in SWISH::Filters

added fix to SWISH::Filters::pp2html and SWISH::Filters::XLtoHTML to save only filename as title without full path

Removed Stem and Stemmer_en

The legacy Porter stemmer was removed. This had been deprecated some time ago. A warning will issue if the old stemmer is indicated in config file, and Stemmer_en1 will be used instead.

GPL'd all the source files with the new Swish-e License

After a source code review, the developers decided to put Swish-e under the GPL with a special exception for linking against libswish-e. See http://swish-e.org/license.html for the details.

Fixed Segfault with updating incremental index

Dobrica Pavlinusic reported a segfaut after updating an index multiple times. José provided updated worddata.c. - April 27, 2005

Fixed NOT check with incremental indexes

Swish was returning results for deleted files when the NOT operator was used.

Fixed bug when using old parsers with zero length input

Thomas Angst reported swish consuming memory when using -S prog to process large number of empty documents.

When -S prog generated a zero length file the old parsers (e.g. TXT) would attempt to read in *all* content from the -S prog program into a buffer. The old parser incorrectly assumed it was reading from a filter and tried to read to eof().

Changes to ParserWarnLevel

The default value for ParserWarnLevel was changed form zero to two.

The ParserWarnLevel controls the error handling of the libxml2 parser. The higher the setting, the more verbose the output. The change to the default is to report when libxml2 has problems parsing a document (which often times results in processing only part of a document).

To get the old behavior, either set ParserWarnLevel to zero in your config file, or use the new -W command line option to set the ParserWarnLevel at run time. If ParserWarnLevel is set in the config file, it will override the -W option.

Also, to see UTF-8 to 8859-1 conversion errors set ParserWarnLevel to 3 or more. Previously, these warning were issues at ParserWarnLevel of one.

Documentation changes

Removed all the target documentation (html, pdf, ps) from cvs. There's now a separate cvs module "swish_website" that is used to generate both the website and the html docs. If building swish-e from cvs please see the README.cvs file for instructions.

Fixed bug in pre-sorted indexes with USE_BTREE

Gunnar Mätzler reported a problem with reading the pre-sorted property index tables when running with USE_BTREE (--enable-enremental). Not all entries were being written to disk. There was/is a question if the "array" code used for pre-sorted indexes with USE_BTREE would be slower. So, added a separate define USE_PRESORT_ARRAY to enable that code when USE_BTREE is set. This allows using the old integer arrays with USE_BTREE. Gunnar reported that this is working, but more testing is needed. Need to compare speed of the array code vs. the non-array code, and to verify the workings of USE_PRESORT_ARRAY code.

Add strcoll() usage for sorting properties

Andreas Seltenreich provided a patch to use strcoll when sorting properties. strcoll is locale dependent.

Fix incremental indexing when adding back a file

Jose fixed a problem with incremental indexing where a file could not be added back to the index once removed.

Patch initially provided by Dobrica Pavlinusic:

    http://swish-e.org/Discussion/archive/2004-12/8694.html
Documentation correction

A change in the default way the index is compressed was not documented in 2.4.3. The change resulted in larger indexes. See CompressPositions below and in SWISH-CONFIG.

libxml2 UTF-8 conversion failures

Fixed issue where a UTF-8 to Latin1 encoding failure would skip more input than just the failed character. Libxml2 passes swish text that is not null terminated, but the libxml2 functions to skip UTF-8 chars expected a null-terminated string. Replace libxml2 call with fixed version.

Version 2.4.3 December 9, 2004

New config directive: CompressPositions

This option enables zlib compression for word data in the index. Previously word data was always compressed but resulted in slower wildcard searches. The default now is to not compress the word data, but results in larger index files. Set to "YES" to get pre-2.4.3 index sizes.

[This CHANGES entry was added after 2.4.3 was released]

Improved error messsages when using incremental indexing

There was a bit of confusion on how to use incremental indexing (still experimental) so added better logic for error messages.

Also fixed a logic error when setting the incremental update mode. Caught by Paul Loner.

Version 2.4.3-pr1 - Wed Dec 1 09:52:50 PST 2004

"Fixed" libxml2's change in UTF8Toisolat1() return value

Bernhard Weisshuhn supplied a patch to parser.c for checking the return value of UTF8Toisolat1(). Seems that libxml2 now returns the number of characters converted instead of zero for success.

   http://bugzilla.gnome.org/show_bug.cgi?id=153937
Added swish-config and pkg-config

Swish now provides a swish-config script and config file for the pkg-config utility. These tools help when building programs that link with the swish-e library.

The SWISH::API Makefile.PL program uses swish-config to locate the installation directory of swish-e. This should make building SWISH::API easier when swish-e is installed in a non-standard location.

Fixed rank bias in merge

Peter van Dijk noticed that MetaNamesRank settings were not being copied to the output index when merging.

Added SwishFuzzy function

SwishFuzzy function (SWISH::API::Fuzzy) lets you stem a word without first searching. This might be helpful for playing with queries prior to the search.

Fixed translate character table

Michael Levy found an error in the table used to translate 8859-1 to ascii7. Luckily, it was an upper case translation and the table is only used on lower case characters.

MetaNamesRank documentation

Changed the 'not yet implemented' caveat to 'implemented but experimental'.

Added Continuation option to config processing

You can now use continuation lines in the config file:

    IgnoreWords \
        the \
        am \
        is \
        are \
        was

There may not be any characters following the backslash.

Fixed Buzzwords (and other word lists entered in the config)

Words entered in config were not converted to lower case before storing in the index.

Fixed metaname mapping problem in Merge

Peter Karman found an error when merging indexes where the source indexes had the same metanames, but listed in a different order in their config files. Words would then be indexed under the wrong metaID number in the output index.

SWISH::Filters and spider.pl updates

The web spider spider.pl was updated to work better with SWISH::Filter by default and also make it easier to use the spider default along with a spider config file. See spider.pl for details.

SWISH::Filter was updated. The way filters are created has changed. If you created your own filters you will need to update them. Take a look at SWISH::Filter and the filters included in the distribution.

Updates to Documentation

Richard Morin submitted formatting and punctuation dates to the README and INSTALL docs.

Added -R option to support IDF word weighting in ranking. (karman)

Added Inverse Document Frequency calculation to the getrank() routine. This will allow the relative frequency of a word in relationship to other words in the query to impact the ranking of documents.

Example: if 'foo' is present twice as often as 'bar' in the collection as a whole, a search for 'foo bar' will weight documents with 'bar' more heavily (i.e., higher rank) than those with 'foo'.

The impact is greatest when OR'ing words in a query rather than AND'ing them (which is the default).

Also added Rank discussion to the FAQ.

Updates to the example scripts

Updated PhraseHighlight.pm as suggested by Bill Schell for an optimization when all words in a document are highlighted.

Updated search.cgi and PhraseHighlight.pm to use the internal stemmers via the SWISH::API module as suggested by Jonas Wolf.

Leak when using C library

David Windmueller found a memory leak when calling multiple searches on a swish handle. The problem was swish loading the pre-sorted property index on every search, even after the table had been loaded into memory.

Swish.cgi now kills swish-e on time out

The example script swish.cgi uses an alarm (on platforms that support alarm) to abort processing after some number of seconds, but it was not killing the child process, swish-e. Bill Schell submitted a patch to kill the child when the alarm triggers.

The template search.tt was renamed to swish.tt

The template was renamed because it's used by swish.cgi, not by search.cgi, which was confusing.

Updates to the search.cgi

The example script search.cgi was updated to work better with mod_perl and to use external template files and style sheets.

New MS Word Filter

James Job provided the SWISH::Filter::Doc2html filter that uses the wvWare (http://wvware.sourceforge.net/) program for filtering MS Word documents. If both catdoc and wvWare are installed then wvWare will be used.

wvWare is reported to do a good job at converting MS Word docs to HTML. In a few tests it did work well, but other cases it failed to generate correct output. It was also much, much slower than catdoc. I tested with wvWare 0.7.3 on Debian Linux. Testing with both is recommended.

John-Marc Chandonia pointed out that if a symlink is skipped by FileRules, then the actual file/directory is marked as "already seen" and cannot be indexed by other links or directly.

Now, files and directories are not marked "already seen" until after passing FileRules (i.e after a file is actually indexed or a directory is processed).

Could not set SwishSetSort() more than once

David Windmueller found a problem when trying to set the sort order more than once on an existing search object. Memory was not correctly reset after clearing the previous sort values.

Access MetaNames and PropertyNames from API

Patch provided by Jamie Herre to access the MetaNames and PropertyNames via the C API and to test via the testlib program. Swish::API also updated to access this data.

SwishResultPropertyULong() bug fixed

David Windmueller reported that SwishResultPropertyULong() was returning ULONG_MAX on all calls. This was fixed.

Null written to wrong location in file.c

Bill Schell with the help of valgrind found a null written past the end of a buffer in file.c in the code that supports the old parsers. This resulted in a segfault while indexing a large set of XML documents.

Fixed problem when indexing very large files

Steve Harris reported a problem when indexing a very large document that caused an integer overflow. José Ruiz updated to used unsigned integers.

Bump word position on block tags with HTML2 parser

Peter Karman pointed out the the libxml2 HTML parser was allowing phrase matches across block level html elements. Swish now bumps the word position on these elements.

Version 2.4.2 - March 09, 2004

Version 2.4.1 - December 17, 2003

Version 2.4.0 - October 27, 2003

Version 2.4.0 (Release Candidate 4) September 26, 2003

Version 2.4.0 (Release Candidate 3) September 11, 2003

Version 2.4.0 (Release Candidate 2) September 10, 2003

Version 2.4.0 (Release Candidate 1) May 21, 2003

Version 2.2.3 - December 11, 2002

Multiple -L options were ORing instead of ANDing. Catch by Patrick Mouret. [moseley]

Version 2.2.2 - November 14, 2002

Pass non- text/* files onto indexing code IF there is a FileFilter associated with the *extension* of the URL. Fixes the problem of not being able to index, say, pdf files by using the FileFilter configuation option.

Fixed bug where nulls were stripped when using FileFilter with -S prog. Catch by Greg Fenton. [moseley]

Version 2.2.1 - September 26, 2002

Version 2.2 - September 18, 2002

Version 2.2rc1 - August 29, 2002

Many large changes were made internally in the code, some for performance reasons, some for feature changes and additions, and some to prepare for new features in later versions of Swish-e.

Changes to Configuration File Directives. Please see SWISH-CONFIG for more info.

Changes to command line arguments. See SWISH-RUN for documentation on these switches.

POD ERRORS

Hey! The above document had some coding errors, which are explained below:

Around line 48:

You forgot a '=back' before '=head2'

Around line 208:

Non-ASCII character seen before =encoding in 'Völter'. Assuming ISO8859-1