Build Your Own Search Engine With ht://dig
Finding things is the #1 problem in this here computerized age. It's easier than ever to squirrel away terabytes of data. But then what? How to sort through all that lot to ever find anything? And why should we have to? In the olden days, file clerks, librarians, and secretaries took care of data storage and retrieval. Then along came computers, and suddenly it was decreed that file clerks, librarians, and secretaries were no longer necessary; that mere mortals like managers and sysadmins and programmers and other ordinary, unassuming humble personages could simply order their computers to do the work.
Fast-forward to the present. You have heroically managed to bring organization and sanity to the great masses of data under your care. All that remains is to construct a user-friendly search engine for the company Web sites. Maybe a public site or two, perchance an internal site or three. Look no farther than the excellent, customizable, easy-to-administer ht://dig.
ht://dig is a suite of several programs:
Once ht://dig is installed, edit /etc/htdig/htdig.conf first. Most of the file is self-explanatory. Be sure to customize the start_url: directive. You may designate a single site, or a space-delimited list of sites. htdig is well-behaved, and will stay strictly within the bounds of the URLs you specify:
start_url: http://websiteone.com http://websitetwo.com
Make sure that database_dir points to the directory you want to store your ht://dig database in. That's enough to get started- now fire it up and index your selected sites:
# rundig -vvv > htdig.log
Turning on maximum verbosity and storing the output in a file lets you check that ht://dig found and indexed everything you wanted it to. After its run is finished, which can take a few minutes, test it out by opening the search page in a Web browser:
This should bring up the the beautiful light blue ht://dig search page, with dropdown menus and Boolean searches and everything.
(Click for a larger image)
Customizing The Search Page
You'll probably want to customize the search page to match the rest of your site. Track down the search.html page and edit it just like any HTML page. And also header.html, long.html, short.html, syntax.html, footer.html, nomatch.html, and wrapper.html.
Fine-Tuning IndexesYou've probably noticed that Google ignores common words like "to," "a," "the," and so forth. You can do the same, to keep htdig fast and useful. In /htdig.conf add the bad_word_list directive. Then make a list of words that you don't want indexed in a text file, one word per line. There should be a sample bad_words file to look at. Then name the file:
The exclude_urls: directive tells htdig to not index the specified URLs. It is a good idea to not index your CGI directory, temp files, robots.txt, .htaccess, Apache binaries- anything that is not meant to be shared.
You can also exclude certain file extensions, and a number of these should already be excluded by default with the bad_extensions: directive. htdig cannot parse binary files, so image files, binary executables, compressed archives, and soundfiles should be excluded.
Converting Files To HTML
Some non-ASCII text files can be converted to HTML by htdig, with a little help. For example, uncomment these lines in /htdig.conf to enable converting .pdf files:
external_parsers: application/postscript /usr/share/htdig/parse_doc.pl \
You'll need either XPDF or Adobe's Acrobat Reader installed. XPDF usually does a better job of translating .pdfs and .ps files to text. You can also convert MS Word docs, PowerPoint files, Excel files, and extract links from Shockwave Flash files. To do this, you need external file converters, a corresponding entry under external_parsers:, and perhaps a tweak to the doc2html.pl script that comes with ht://dig. The doc2html.pl script works as-is with these conversion utilities:
- catdoc -- extract text from Word documents
- rtf2html -- convert RTF documents to HTML
- pdftotext -- extract text from Adobe PDFs. Comes with XPDF
- ps2ascii --extract text from PostScript
- pptHtml -- convert Powerpoint files to HTML
- xlHtml -- convert Excel spreadsheets to HTML
- swfparse -- extract links from Shockwave flash files
These are all standard Linux utilities that you can find in the usual haunts. You can add as many more as you like, provided you edit the doc2html.pl script to call the new utilities. See the doc2html/README for more information.
You'll probably want to create a cron job for updating the database periodically. rundig can suck up a lot of system resources, so schedule it for slack times. /etc/crontab is quick and easy, like this:
0 0 **0 root /usr/bin/rundig
This starts rundig at midnight every Sunday.