Build Your Own Search Engine
(Wiby Install Guide)
Overview
Installation
Controlling
ScalingOverview
Wiby is a search engine for the World Wide Web. The source code is now free as of July 8, 2022 under the GPLv2 license. I have been longing for this day! You can watch a quick demo here.
It includes a web interface allowing guardians to control where, how far, and how often it crawls websites and follows hyperlinks. The search index is stored inside of an InnoDB full-text index.
Fast queries are maintained by concurrently searching different sections of the index across multiple replication servers or across duplicate server connections, returning a list of top results from each connection, then searching the combined list to ensure correct ordering. Replicas that fail are automatically excluded; new replicas are easy to include. As new pages are crawled, they are stored randomly across the index, ensuring each search section can obtain relevant results.
The search engine is not meant to index the entire web and then sort it with a ranking algorithm. It prefers to seed its index through human submissions made by guests, or by the guardian(s) of the search engine.
The software is designed for anyone with some extra computers (even a Pi), to host their own search engine catering to whatever niche matters to them. The search engine includes a simple API for meta search engines to harness.
I hope this will enable anyone with a love of computers to cheaply build and maintain a search engine of their own. I hope it can cultivate free and independent search engines, ensuring accessibility of ideas and information across the World Wide Web.
Web Traffic | | +-----------+-----------+ | Reverse Proxy (nginx) | +-----------+-----------+ | | +-----------+-----------+ | Wiby Core Server(s) |+-----------------+----------------------------+ |(Golang or PHP version)| | | +-----------+-----------+ +----------+----------+ +---------+---------+ | |Replication Databases|+-----+|Replication Tracker| | +----------+----------+ +-------------------+ +-----------+-----------+ | | Primary Database |+-----------------+ | (MySQL or MariaDB) | +----+-------------+----+ | | | | +----+-----+ +----+----+ | Web | | Refresh | |Crawler(s)| |Scheduler| +----------+ +---------+
Installation
I can only provide manual install instructions at this time.
Note that while the software is functionally complete, it is still in beta. Anticipate that some bugs will be discovered now that the source is released. Ensure that you isolate the search engine from your other important services, and if you are running parts of it out of your home, keep the servers on a separate VLAN. Make sure this VLAN cannot access your router or switch interface. Continue this practise even when the software reaches "1.0".
If you have created a "LAMP", or rather a "LEMP" server before, this isn't much more complicated. If you've never done that, I suggest you find a "LEMP" tutorial.
Build a LEMP server
Digital Ocean tutorials are usually pretty good so here is a link to one for Ubuntu 20 and Ubuntu 22.
For the sake of simplicity, assume all instructions are for Ubuntu 20 or 22. If you are on a different distro, modify the install steps accordingly to suit your distro.
If you don't have a physical server, you can rent computing space by looking for a "VPS provider". This virtual computer will be your reverse proxy, and if you want, it can host everything else too.
Install the following additional packages:
apt install build-essential php-gd libcurl4-openssl-dev libmysqlclient-dev golang git
Get Wiby Source Files
Download the source directly from Wiby here, or from GitHub. The source is released under the GPLv2 license. Copy the source files for Wiby to your server.
Compile the crawler (cr), refresh scheduler (rs), replication tracker (rt):
gcc cr.c -o cr -lmysqlclient -lcurl -std=c99 -O3 gcc rs.c -o rs -lmysqlclient -std=c99 -O3 gcc rt.c -o rt -lmysqlclient -std=c99 -O3If you get any compile errors, it is likely due to the path of the mysql or libcurl header files. This could happen if you are not using Ubuntu. You might have to locate the correct path for curl.h, easy.h, mysql.h, then edit the #include paths in the source files.
Build the core server application:
The core application is located inside the go folder. Run the following commands after copying the files over to your preferred location:For Ubuntu 20: go get -u github.com/go-sql-driver/mysql For Ubuntu 22 OR latest Golang versions: go install github.com/go-sql-driver/mysql@latest go mod init mysql go get github.com/go-sql-driver/mysql go build core.go go build 1core.goIf you are just starting out, you can use '1core'. If you are going to setup replication servers or you are using a computer with a lot of available cores, you can use 'core', but make sure to read the scaling section.
If you want to use 1core on a server separate from your reverse proxy server, modify line 37 of 1core.go: replace 'localhost' with '0.0.0.0' so that it accepts connections over your VPN from your reverse proxy.
You can also use index.php in the root of the www directory and not use the Go version at all. Though the PHP version is used mainly for prototyping.
Build the Primary Database:
Make sure these lines are inside of /etc/mysql/my.cnf, then restart mysql[client] default-character-set=utf8mb4 [mysql] default-character-set = utf8mb4 [mysqld] max_connections = 2000 ft_min_word_len=1 sql_mode = "NO_BACKSLASH_ESCAPES" character-set-server = utf8mb4 collation-server = utf8mb4_unicode_520_ci innodb_ft_enable_stopword=0 skip-character-set-client-handshake default-authentication-plugin=mysql_native_password wait_timeout = 800 #memory use settings, you should adjust this based on your hardware innodb_buffer_pool_size = 1342177280 innodb_buffer_pool_instances = 2 innodb_flush_method = O_DIRECTLogin to MySQL and type:create database wiby; create database wibytemp;Import the wiby and wibytemp database files:mysql -u root -p wiby < wiby.sql mysql -u root -p wibytemp < wibytemp.sqlLogin to MySQL, create the following accounts and give them the correct access:
create user 'guest'@'localhost' identified by 'qwer'; create user 'approver'@'localhost' identified by 'foobar'; create user 'crawler'@'localhost' identified by 'seekout'; create user 'remote_guest'@'localhost' identified by 'd0gemuchw0w'; use wiby; grant select on accounts to 'approver'@'localhost'; grant select on reviewqueue to 'approver'@'localhost'; grant insert on indexqueue to 'approver'@'localhost'; grant delete on reviewqueue to 'approver'@'localhost'; grant update on reviewqueue to 'approver'@'localhost'; grant select on indexqueue to 'crawler'@'localhost'; grant insert on windex to 'crawler'@'localhost'; grant insert on indexqueue to 'crawler'@'localhost'; grant update on windex to 'crawler'@'localhost'; grant delete on indexqueue to 'crawler'@'localhost'; grant delete on windex to 'crawler'@'localhost'; grant select on windex to 'crawler'@'localhost'; grant insert on reviewqueue to 'crawler'@'localhost'; grant select on windex to 'guest'@'localhost'; grant insert on reviewqueue to 'guest'@'localhost'; grant insert on feedback to 'guest'@'localhost'; grant select on feedback to 'approver'@'localhost'; grant delete on feedback to 'approver'@'localhost'; grant insert on graveyard to 'approver'@'localhost'; grant update on graveyard to 'approver'@'localhost'; grant delete on graveyard to 'approver'@'localhost'; grant select on graveyard to 'approver'@'localhost'; grant update on accounts to 'approver'@'localhost'; grant insert on accounts to 'approver'@'localhost'; grant delete on accounts to 'approver'@'localhost'; grant select on ws0 to 'crawler'@'localhost'; grant update on ws0 to 'crawler'@'localhost'; grant insert on ws0 to 'crawler'@'localhost'; grant delete on ws0 to 'crawler'@'localhost'; grant select on ws1 to 'crawler'@'localhost'; grant update on ws1 to 'crawler'@'localhost'; grant insert on ws1 to 'crawler'@'localhost'; grant delete on ws1 to 'crawler'@'localhost'; grant select on ws2 to 'crawler'@'localhost'; grant update on ws2 to 'crawler'@'localhost'; grant insert on ws2 to 'crawler'@'localhost'; grant delete on ws2 to 'crawler'@'localhost'; grant select on ws3 to 'crawler'@'localhost'; grant update on ws3 to 'crawler'@'localhost'; grant insert on ws3 to 'crawler'@'localhost'; grant delete on ws3 to 'crawler'@'localhost'; grant select on windex to 'remote_guest'@'localhost'; grant select on ws0 to 'remote_guest'@'localhost'; grant select on ws1 to 'remote_guest'@'localhost'; grant select on ws2 to 'remote_guest'@'localhost'; grant select on ws3 to 'remote_guest'@'localhost'; use wibytemp; grant select on titlecheck to 'crawler'@'localhost'; grant insert on titlecheck to 'crawler'@'localhost'; grant delete on titlecheck to 'crawler'@'localhost'; grant select on rejected to 'approver'@'localhost'; grant insert on rejected to 'approver'@'localhost'; grant delete on rejected to 'approver'@'localhost'; grant insert on rejected to 'crawler'@'localhost'; grant select on reserve_id to 'crawler'@'localhost'; grant insert on reserve_id to 'crawler'@'localhost'; grant delete on reserve_id to 'crawler'@'localhost'; grant select on crawled to 'crawler'@'localhost'; grant insert on crawled to 'crawler'@'localhost'; grant delete on crawled to 'crawler'@'localhost'; FLUSH PRIVILEGES;Copy the HTML files and PHP scripts to your web server
Copy the contents of the the html directory into the nginx html directory (/var/www/html)Configure nginx for Wiby
In /etc/nginx/, create a directory called 'phpcache', and another one called 'cache'.
Instead of going through every detail, I will provide a template for you to try out as your default nginx config from inside /etc/nginx/sites-available/ of the source code. New nginx versions depricated /sites-available, so you might have to place the template inside /etc/nginx/conf.d instead.
You should learn nginx configuration on your own, this template is just to assist. If you are using only the php version, comment all "core app" location entries to revert Wiby search to the php only version.
Make sure ssl_certificate and ssl_certificate_key have the path for your SSL files instead of the example paths. If you don't want to use SSL, just remove the server {} configuration for SSL connections (on port 443). Also the example file references php7.4-fpm.sock, so if you are using a different version remember to update that as well (such as php8.1-fpm.sock on Ubuntu 22).
Start the Refresh Scheduler
This program (rs) will make sure all pages indexed are refreshed at least once per week (or sooner depending on how you assign updates to an individual website). You may want to run this on startup, easiest way to set that is with a cron job (crontab -e). Run './rs -h' to get more parameters and info needed to run multiple crawlers. To start manually: 'nohup ./rs &' then press ctrl-c.
Start the Crawler
It is best to run the crawler in a Screen session so that you can monitor its output. You can have more than one crawler running as long as you keep them in separate directories, include symlinks to the same robots folder and 'shards' file, and also set the correct parameters on each. To view the parameters, type './cr -h'. Without any parameters set, you can only run one crawler (which might be all you need anyway). If necessary, you can change the database connection from 'localhost' to a different IP from inside cr.c, then rebuild.
If using more than one crawler, update the variable '$num_crawlers' from inside of review.php and graveyard.php (line 73) to the number of crawlers you are using.
Note that you may need to change the crawler's user-agent (CURLOPT_USERAGENT in cr.c and checkrobots.h) if you have issues indexing some websites. Pages that fail to index are noted inside of abandoned.txt.
Make sure the robots folder exists, or create one in the same directory as the crawler. All robots.txt files are stored in the robots folder. They are downloaded once and then referenced from that folder on future updates. Clear this folder every few weeks to ensure robots.txt files get refreshed from time to time. You can also create custom robots.txt files for specific domains and store them there for the crawler to reference. To disable checking for robots.txt files, comment out the line calling the "checkrobots" function inside of cr.c.
If crawling through hyperlinks on a page, the following file types are accepted: html, htm, xhtml, shtml, txt, php, asp. Links containing parameters are ignored. These limitations do not apply to pages directly submitted by people.
Start the Replication Tracker
The tracker (rt) should run in the same directory that you will run the core server on. You do not need this if running 1core or the PHP only version. You can use a cron job to run it on startup, or start it manually with this command: 'nohup ./rt &' then press ctrl-c.
Start the Core Server
You can run the core server on startup with a cron job, or start it manually with this command: 'nohup ./core &' then press ctrl-c.
If you are just starting out, '1core' or the php version is easiest to start with. Use 'core' if you want to scale computer resources as the index grows or if you have at least four available CPU cores. It is recommended you use 'core' as it makes better use of your CPU, but make sure to read the scaling section.
If you want to use 1core on a server separate from your reverse proxy server, modify line 37 of 1core.go: replace 'localhost' with '0.0.0.0' so that it accepts connections over your VPN from your reverse proxy.
Set Administrator Password for the Web Interface
There is no default web login, you will have to set this manually the first time:Rename the /html/hash folder to something private. Edit html/private_folder_name/hashmake.php and change 'secretpassword' to your preferred admin password. Access /private_folder_name/hashmake.php from your browser and copy down the hash. After you have copied it down, delete or remove hashmake.php from your web server folder so that the hash cannot be discovered.Login to MySQL and create the account:use wiby; INSERT INTO accounts (name,hash,level) VALUES('your_username','your_password_hash','admin');You can now access /accounts/ from your browser, login to create and manage all accounts for administrators and guardians of the search engine.
admin - Can access all web forms for the search engine and use the /accounts/ page to create and delete accounts.
guardian - The main role of a guardian is to gatekeep the index of the search engine. Can access all forms except for /readf/, and can only use the /accounts/ page to change their own password.
Controlling the Search Engine
There are several forms to control the search engine. There is no central form linking everything together, just a collection of different folders that you can rename if you want.
/submit/
This public facing form allows users of the search engine to submit websites for indexing, provided they comply with your submission criteria, which you can modify on /submit/form.html.php.
/accounts/
This is the account management page. Admins have options to create, lock, change account type, delete, and reset passwords. Guardians have the option to change their password.
/review/
This is the most important form, intended for you to verify website submissions meet your criteria. Up to 10 pages are assigned to each guardian or admin that accesses the form. The pages will remain assigned to that account for up to 30 minutes. From here you can control how much, how deep, and how often the web crawler will access each submission. Here is an example of the available options for a website submission:
url_that_was_submitted
[Worksafe] [Surprise] [Skip] [Bury] [Deny] [Updatable]
[Crawl: Depth Pages Type Enforce Rules Repeat]
Explanation of the above options:
Worksafe - Indicates if the website is safe for work. Set by the user who submitted the website, however you can change it based on your determination.
Surprise - Checking this box will put it in the "surprise me" feature, where users get redirected to random websites when they click "surprise me". Note that this feature won't show NSFW websites even if they are set to surprise.
Skip - Selecting this option will skip indexing the page and it will reappear on the review form after you submit the rest of the pages for crawling.
Bury - Selecting this will move the page to a graveyard (/grave/), a holding place with the same options as /review/ for websites that might have stopped working but that you suspect may come back online. The crawler will detect this automatically and send the page back into review. When you click on the link and see a 404, you can be assured the crawler sent it back to review after failing two update cycles. This also happens if the title of the page changes. The crawler will only do this for pages directly submitted by people. This curtesy is not given to websites that are automatically crawled but then fail to work later on. For those sites, after two failed update cycles, the page will be removed.
Deny - Select this to drop the page from being indexed. If the page does not meet your submission criteria, this would be the option to remove it from the queue.
Updatable - The update cycle for the web crawler to return to the page. This only applies to pages submitted by people, pages found by link crawling always go on a 1 week update cycle.
------------------- Crawl -------------------
The options listed below control how the crawler indexes hyperlinks on the website. By default, the crawler does not index any hyperlinks, it will only index the page that is submitted.
Depth - How many layers of links to crawl through. You must set at least a depth of 1 if you want to crawl any hyperlinks. Setting a negative value = no limit. Be careful about that.
Pages - How many pages to crawl on each link layer (depth). They will be randomly selected. You must set at least 1 if you want to crawl any hyperlinks. Setting a negative value = no limit. Be careful about that.
Type - Indicates if you want to only crawl links local to the website, or links external to the website, or both.
Enforce rules - This is a blunt tool that checks if pages have more than two scripts and/or css files. If the limit is exceded, the page will not be indexed. I don't use it and prefer to manually check based on more forgiving criteria.
Repeat - While the crawler will always return to update each page in the index, it wont crawl through hyperlinks again unless you tell it to. Even so, it only crawls hyperlinks on the page at a depth of 1 when repeat is selected.
/ban/
Delete or ban a list of URLs from the index with this form. You can't delete an entire domain with it, for that you can build your own query in the MySQL console.
/bulksubmit/
Admins/Guardians can import a list of URLs into the review queue with this form.
/feedback/
Users can submit feedback for you with this form.
/readf/
Where admin accounts can read feedback submitted by users.
/grave/
It has the same features as /review/. Websites that you don't yet want to index but don't want to forget about are stored inside /grave/ by selecting 'bury' from inside /review/. The web crawler will (only for pages submitted directly by people), move 404'd pages or pages where the title has changed back to /review/ after two update cycles where the page does not return to normal. So after a few weeks you may notice dead pages appearing in /review/, you can decide to drop the page or to bury it where it will be moved to /grave/. The page might go back to normal at some point and you can check /grave/ to see if it resurrects.
/insert/
This was the first form created back in late 2016 to populate the Wiby index and see if the search engine could even work as a proof of concept. It was meant to manually enter pages into the index as no crawler existed yet. It is still useful if you want to manually index a page that refuses to permit the crawler to access it. In that case, set updatable to 0.
/tags/
If you want to force a website to appear at the top rank for specific single word queries (like "weather"), you can force it by tagging the words to the target url.
/json/
This is the JSON API developers can use to connect their services to the search engine. Instructions are located at that location.
Additional Notes
If you need to stop the web crawler in a situation where it was accidently queued to index an unlimited number of pages, first stop the crawler program, truncate the indexqueue table 'truncate indexqueue;', then restart the crawler.
Scaling the Search Engine
You can help ensure sub-second search queries as your index grows by building MySQL replica servers on a local network close to each other, run the core application AND replication tracker (rt) in the same directory on one or more full-replica servers and point your reverse proxy to use it. Edit the servers.csv file for rt to indicate all available replica IPs and available shard tables (ws0 to wsX). Four are already preconfigured.
If you have a machine with at least four CPU cores, entering multiple duplicate entries to the same sever inside servers.csv (e.g. one for each CPU core) works also. By default, four duplicate connections are already set to use your existing machine.
The core application checks the replication tracker (rt) output to determine if any replicas or duplicate connections are available, it will initiate a connection on those replicas and task each one to search a different shard table, drastically speeding up search speeds.
The search results per page limit is 12, and should evenly divide 'into' OR 'by' the total number of replicas/shards defined in servers.csv. You don't need to restart the tracker when editing servers.csv. As an example, if you have three computers with a 4-core CPU on each, you can create up to 12 shard tables, then point the tracker to use 4 shards on each computer for maximum use. Another option would be to keep the default four shard and four duplicate connection configuration, host the core application and rt on each computer, and use nginx to load balance traffic between them.
The reverse proxy and replica servers can be connected through a VPN such as wireguard or openvpn, however the IPs for servers.csv should be the local IPs for the LAN the replicas are all connected on. See the instructions to setup a MySQL replica, and here is a longer tutorial on MySQL replicas should you need more info.
Indicate the number of shards in the 'shards' file that the crawler references (four are already preconfigured). If for some reason you need to rebuild/rebalance the shard tables, see the directions here. To create more shard tables, see this section. If for some reason you only want to host specific shard tables on a replica, you can use replication filtering.
Instructions for Building a MySQL Replica:
On the primary server add these lines to my.cnf under [mysqld] but only once you have a VPN to reach your replicas. Replace my.vpn.ip with your own, then restart MySQL.#setting up replication below bind-address = 127.0.0.1,my.vpn.ip server-id = 1 log_bin = /var/log/mysql/mysql-bin.log binlog_do_db = wiby binlog_format = mixedIn MySQL on the primary server, create a user for replica access, replace the IP 10.0.0.% to that for your own VPN IP and allowed subnet:create user 'slave_user'@'10.0.0.%' identified by 'd0gemuchw0w'; GRANT REPLICATION SLAVE ON *.* TO 'slave_user'@'%'; FLUSH PRIVILEGES;On the replica server, after installing MySQL, ensure the following my.cnf configuration, set the server-id as a unique id for each replica, then restart MySQL:[client] default-character-set=utf8mb4 [mysql] default-character-set = utf8mb4 [mysqld] max_connections = 2000 ft_min_word_len=1 sql_mode = "NO_BACKSLASH_ESCAPES" #character-set-client-handshake = FALSE character-set-server = utf8mb4 collation-server = utf8mb4_unicode_520_ci innodb_ft_enable_stopword=0 skip-character-set-client-handshake default-authentication-plugin=mysql_native_password wait_timeout = 800 #memory use settings, you should adjust this based on your hardware innodb_buffer_pool_size = 1342177280 innodb_buffer_pool_instances = 2 innodb_flush_method = O_DIRECT #setting up replication below bind-address = 0.0.0.0 server-id = 2 relay_log_info_repository = TABLE relay_log_recovery = ON sync_binlog=1Make sure only VPN and VLAN addresses can reach your replicas. The bind address of 0.0.0.0 can be replaced with '127.0.0.1,replica.vpn.ip' which is safer but also more crash prone if the VPN address is not available on startup.
To export the database to the replica server, on the primary server, stop the web crawler and hide any web forms that can accept new data, then open MySQL and do the following.USE wiby; FLUSH TABLES WITH READ LOCK; SHOW MASTER STATUS; +------------------+----------+--------------+------------------+-------------------+ | File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set | +------------------+----------+--------------+------------------+-------------------+ | mysql-bin.000055 | 15871269 | wiby | | | +------------------+----------+--------------+------------------+-------------------+Keep the above session window open (or run it in a screen session).
Copy down the information from that table. In a separate session window, export the database:mysqldump -u root -p wiby > wiby.sqlOnce you have exported the database and recorded what you need, you can unlock the tables, and resume as normal. On the session window displaying the master status:UNLOCK TABLES;You can now close that window if you want.
On the replica server, login to MySQL and create the database:CREATE DATABASE wiby; EXIT;Import the database:mysql -u root -p wiby < wiby.sqlLogin to MySQL and type the following but replace the primary_server_ip, MASTER_LOG_FILE, and MASTER_LOG_POS with yours from the table:CHANGE MASTER TO MASTER_HOST='primary_server_ip',MASTER_USER='slave_user', MASTER_PASSWORD='d0gemuchw0w', MASTER_LOG_FILE='mysql-bin.000055', MASTER_LOG_POS=15871269; START SLAVE;To verify that the replica is syncronized, type the following on the replica in MySQL:SHOW SLAVE STATUS\GMake sure that:Slave_IO_Running: Yes Slave_SQL_Running: YesIn MySQL on the replica, create the accounts required for the replication tracker and core application. Note that the remote_guest account will allow connections from any outside machine. Make sure your replica is protected behind a firewall.
use wiby; create user 'guest'@'localhost' identified by 'qwer'; grant select on windex to 'guest'@'localhost'; create user 'remote_guest'@'%' identified by 'd0gemuchw0w'; grant select on windex to 'remote_guest'@'%'; grant select on ws0 to 'remote_guest'@'%'; grant select on ws1 to 'remote_guest'@'%'; grant select on ws2 to 'remote_guest'@'%'; grant select on ws3 to 'remote_guest'@'%'; create user 'crawler'@'localhost' identified by 'seekout'; FLUSH PRIVILEGES;To update the host for any account, do the following:
use mysql; Select user,host from user; RENAME USER 'username'@'oldhost' TO 'username'@'newhost';
Creating More Shard Tables
There are four shard tables already in the database, but if you need more:
Stop the crawler and update the number in the 'shards' file, then copy a shard table entry (wsX) from the wiby.db template file, renaming it in the proper number sequence, and paste that into the mysql console on the primary database.
Make sure to give access to the new shard tables.
You will need to rebalance the shards, follow the steps below, then restart the crawler. Going forward it will round-robin insert into those shards as new pages are crawled.
Accessing Additional Shards
Apply the account access permissions listed here for core app and rt access to each replica and here for crawler access to each new shard table on the primary server or replica hosting the core app.
Balancing Additional Shards
For now you would have to manually rebalance shards when adding new ones. The most straight-forward way to rebalance them is to:
Update 'servers.csv' with the additional shard connections being used.
Stop the crawler and update 'shards' with the new total of shards being used.
Start up rt, then copy down the id numbers referenced for each connection.
Truncate all the shard tables on the primary:truncate ws0; truncate ws1; etc..Repopulate the 1st shard table (and so on), on the primary server:"UPDATE windex SET shard = 0 WHERE id BETWEEN 0 AND 5819;" replacing those id numbers with those indicated by rt. "INSERT INTO ws0 SELECT * FROM windex WHERE id BETWEEN 0 AND 5819;" replacing those id numbers with those indicated by rt. Repeat those steps for each shard table.These changes will propagate down to the replicas, and the core application will be able to use them as long as permissions to those tables were added.
Load Balancing
You should run the core application on one or more of your replicas and have nginx send traffic to it, this way you can reduce the burden on your VPS. The replication tracker (rt) must run on the same server and directory that the core application is running on (not required for 1core).
Add the replica server's VPN address/port to upstream remote_core {} from the default config for nginx (see the provided example template). You can use the VPS as a backup instead by adding 'backup' to its address (eg: server 127.0.0.1:8080 backup;)
Additional Notes
The crawler stores a maximum of 80KB worth of text from the body of each webpage. To change this limit, edit the "body_len" definition from inside htmlparse.h and recompile the crawler. This will affect the total size of the index and overall search speeds.