Welcome to the BigAtticHouse
424 South division
Chenoa,IL 61726
v. 815-945-7649
BigAtticHouse Vectorspace Database (VSDB)
provides high speed, scalable search capabilities for any project.

Tutorial 1: Building a simple search engine with PHP and VSDB

Included with VSDB are several .php files that will assist you in creating php pages that can work directly with VSDB. This tutorial describes the steps necessary to load and query vsdb, and how to parse the results.
Requirements: LAMP System (Linux/Apache/Mysql/PHP), VSDB 1.5

1. Connecting to VSDB from PHP

VSDB is a socket-based server, and can be accessed via tcp/ip. Included with your VSDB server in the subfolder clients/php/ you will find stemmer.php, buildvector.php and vsdbclient.php. Your php page will need to connect to VSDB in order to update or query it. This tutorial assumes you have a VSDBd instance running already.
(both are released under the GPL, so you may require an LGPL or alternate stemmer class if your project requires closed source distribution)
    <?
    //Connect
    $conn = ConnectToVSDB("localhost","10001");
    
    //Disconnect
    DisconnectVSDB($conn);
    ?>
 

2. Loading the Vectorspace

In order to do anything useful, our Vectorspace must be populated. In this example, we will open a sample database of notes and populate VSDB with the data. It is worth noting that VSDB uses the equivalent of GUIDs to identify both vectors and dimensions. This corresponds directly with windows GUIDs as well as MD5 hashes of text data. This is the fundamental format with which VSDB communicates over the wire. Note that we tend to use these GUIDs at BigAtticHouse, but we will use hashed versions of integer keys to conform with many of our users' databases.
Note that this process would only be run one time to initially populate the service. Caching the dataset to disk will allow it to be reloaded when the service starts the next time (in case of system restart). Because VSDB resides in memory, it must be periodically flushed to disk in order to save its state.
Although you can stay connected and update the vectors in a batch, it should be noted that VSDB performs better if you disconnect between updates. VSDB is select()-based, and batch updates may cause delays in clients that are attempting to query. You may also want to put a sleep() call in the loop to allow your end-users to have more CPU time.

Let us suppose we have the following table defined in MySQL:
 CREATE TABLE NOTES (
     NOTEID int(6) NOT NULL auto_increment,
     NOTE text,
     primary key (NOTEID)
 );
We would build a simple query to read the table and populate VSDB:
<?
//Connect to MySQL
$db = mysql_connect("localhost","[user]","[password]") or die(mysql_error()); mysql_select_db("[our_notes_database]",$db) or die(mysql_error());
//Query the database
$sql = "select noteid,note from NOTES"; $rowset = mysql_query($sql,$db); while ($row = mysql_fetch_row($rowset)){
//Connect to VSDB
$conn = ConnectToVSDB("localhost","10001");
//get data from the row
$noteid = $row[0]; $notes = $row[1];
//create our vector array and a vectorid
$vector = BuildVectorFromText($notes);
//"salt" the md5 hash to generate a GUID.
$vectorid = md5("NOTES_TABLE_$noteid");
// send the vector to VSDB. we only UPDATE - regardless of the // record existing or not - VSDB will know what to do with it.
UpdateVsdb ($conn,$vectorid,$vector);
//Disconnect from VSDB
DisconnectVSDB($conn); } ?>

3. Loading a single vector (editing)

When editing or deleting records, we can briefly connect, update and disconnect to keep our dataset in sync. A cron job similar to the one in step 2 that will periodically update the entire vectorspace is also a common usage. This allows end-users querying the system to have the greatest possible UI experience. If your project only has a few users (20 or less), you can disregard this practice.

4. Deleting a vector

To delete a vector, you merely pass the vectorid to VSDB.
<?
//Connect to VSDB
$conn = ConnectToVSDB("localhost","10001");
//"salt" the md5 hash to generate a GUID.
$vectorid = md5("NOTES_TABLE_1");
//Cache the VSDB dataset
DeleteFromVSDB($conn,$vectorid);
//Disconnect from VSDB
DisconnectVSDB($conn); ?>

5. Caching the vectorspace

VSDB will cache a copy of the vectorspace to its operating folder (sent to VSDBd as a parameter when the service started) as vsdb.dat and thesaurus.dat.
<?
//Connect to VSDB
$conn = ConnectToVSDB("localhost","10001");
//Cache the VSDB dataset
CacheVSDB($conn);
//Disconnect from VSDB
DisconnectVSDB($conn); ?>

6. Querying the vectorspace

A Query is just another vector, and built the same way. You pass the query to the vectorspace, and define a Threshold that will limit results. The threshold is a floating point number (a C Single) Exact matches = 1 and unrelated vectors will be <=0. Some experimentation may be required to tune the threshold to your vectorspace, depending on how the vectors are constructed. Extensive text (notes, etc) will require a lower threshold around 0.25-0.33, while more precise measurements (# of rooms in house, Lat and Long, years of experience) will likely require higher thresholds near 0.75- 0.90. Selecting a Threshold of 1 will only find exact matches.
In the example assume $querytext is user input from a text box.
<?
//Connect to VSDB
$conn = ConnectToVSDB("localhost","10001");
//create our query vector
$query = BuildVectorFromText($querytext);
//VSDB requires a queryid.
$queryid = md5(date());
//Execute the query.
$result = QueryVsdb_Array($conn,$queryid,$query,0.33);
//Disconnect from VSDB
DisconnectVSDB($conn); ?>

7. Handling the results

Query results are returned in the form of a deliminated text string. The name fields represent the vectorid, and the value fields represent a floating point number that is the "score" for that particular vector. VSDB 1.5 does not sort the results. Version 2.0 (currently in beta) *does* sort the results.
Once data is retrieved, it can either be put into a temporary table, to use to join back into your data tables - or have it call a function to display the items. If you know you may have a large resultset, inserting into a temporary table and sorting/paging the results using the MySQL limit clause has proven to be very effective in production environments.
<?
//Connect to VSDB
$conn = ConnectToVSDB("localhost","10001");
//create our query vector
$query = BuildVectorFromText($querytext);
//VSDB requires a queryid.
$queryid = md5(date());
//Execute the query.
$result = QueryVsdb($conn,$queryid,$query,0.33); $lines = split("\n",$result); $sz = sizeof($lines); for ($i=0;$i<$sz;$i++){ $line=split(" ",$lines[$i]); $vectorid = $line[0]; $score=$line[1];
//Call a function that will display the item.
Call_Some_Function_To_Display($vectorid,$score); }
//Disconnect from VSDB
DisconnectVSDB($conn); ?>

8. Additional Information

This tutorial is c. 2007 BigAtticHouse, and provided for instructional use by our clients. Some changes in interfaces may occur, and cases presented might not be optimal for your operating environment, as this document is a guideline for usage.

Document Version 1.0, 12/26/2007 - Michael Johnson