Friday, May 13, 2016

Apache Solr Install w/ MongoDB Indexing

Introduction

Solr is an awesome search platform.  Based on the trusted and beloved Lucene platform, Solr offers everything Lucene offers and more such as replication and sharding for horizontal scaling and faceted filtration of results and multi-core separation.  SolrCloud is a Solr offering with sharding in place for real time distributed reading and writing from a farm of servers designed to handle extra storage and traffic.  Traditional Solr setups can also be distributed but only in the form of replication where multiple slave nodes can periodically pull from the master index and all search queries are performed on the slaves.  This is pull replication where SolrCloud implements real-time push replication.


Solr vs ElasticSearch (ES)

Both are Java-based and both are extended forms of Lucene so what sets them apart?  For the most part they are very similar with a few differences.  For the average business case with a need for a highly-scalable search platform, either one will be fine.  These are some of the differences between both platforms.

  1. ES is a newer platform which makes them have the vibe of being more modern and distributed performance is handled out-of-the-box with minimal configurations and setup steps.  SolrCloud requires a separate distributed environment system such as Apache Zookeeper to sync up all the shards.  This requires additional steps as compared to ES in setting up the distributed environment which has the built-in ZenKeeper tool as part of the install and is distributed with very little effort.  With this in mind, you could say Solr setups are for the more serious developers since more in-depth knowledge is required to get things up and running.  Also, ES setups are known to have long-term problems because the initial setup is not very demanding and since "anyone can get it up and running" they may lack the integrity or knowledge to maintain it.
  2. Solr is an older product by a few years and the community reflects that.  The community is bigger and more established with more resources available on the web to address issues that you may encounter.   Solr also has better official documentation. 
  3. ES has a more robust analytics suite.  This data proves useful for marketing purposes.


Steps To Install Solr 6.0.0 in Linux

  1. yum -y update
    Ensure everything is up to date first
  2. java -version
    Check java version
  3. yum list available java*
    Check all available java versions in the YUM package manager
  4. yum install java-1.8.0-openjdk.x86_64
    Install the latest version of java if not installed already
  5. cd /tmp
    Change the working directory to temp to prepare for download
  6. wget http://apache.org/dist/lucene/solr/6.0.0/solr-6.0.0.tgz
    Download the solr install
  7. tar -zxvf solr-6.0.0.tgz
    Uncompress file
  8. cd solr-6.0.0
  9. bin/install_solr_service.sh /tmp/solr-6.0.0.tgz
    Run the script to install solr

    Solr service is now insalled. Installation dir is  /opt/solr. Data dir is /var/solr.

    Solr user and group created by script and is used to create Solr cores.
  10. View admin web UI here:

    http://localhost:8983/solr/

    or

    http://IPADDRESS:8983/solr/ (remotely)

    This may require updating the firewall for public access and opening up the default port 8983.

Create Sample Core and Load Documents

  1. sudo chown -R solr:solr /var/solr/
    Make solr user the owner of dir
  2. sudo chown -R solr:solr /var/solr
  3. cd /opt/solr
  4. su - solr
    Switch to solr user
  5. bin/solr create -c documents
    Create documents core, if permissions are correct you will not get errors
  6. bin/post -c documents docs
    Load up the core with the test html docs, do not use if not testing

Operations

All configuration data points will reside in the folder of the core in the file system.  In our case, the path is:

/var/solr/CORE_NAME/conf

All changes to the config files will only be reflected in the system after restarting.  These are the commands:
  1. service solr start
  2. service solr stop
  3. service solr restart
 

Data Import Handler and MongoDB

 The example above is a simple scenario where static files in the same file system are pulled in and indexed.  In a real environment, you may have to index data records in a relational database or even NoSQL databases such as MongoDB.  All this is done using the DataImportHandler.  There is support for indexing remote files and database records along with other sources but there is no native support for indexing Mongo collections and documents.  At this time a custom solution is needed.  You will need to download the latest versions of the following JAR files:

  1. solr-dataimporthandler-x.x.x.jar 
  2. solr-mongo-importer-x.x.x.jar 
  3. mongo-java-driver-x.x.jar
 
These files have to be dropped into:

/opt/solr/dist

The core configuration files are at:/var/solr/CORE_NAME/conf
In this folder will be XML files used to define the system behavior.  You have to make the system aware of the JAR files that are dropped in by updating the solrconfig.xml file.  Please add in with the other lib declarations:

  <lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-dataimporthandler-.*\.jar" />
  <lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-mongo-importer-.*\.jar" />
  <lib dir="${solr.install.dir:../../../..}/dist/" regex="mongo-java-driver-.*\.jar" />


The schema file is managed-schema.xml.  In this file is where all field names and definitions are stored.  Please add the custom fields for the indexes here.  They will look something like this:

<field name="firstName" type="string" indexed="true" stored="true"/>
<field name="email" type="string" indexed="true" stored="true"/>
etc... 

In the solrconfig.xml file you will need to indicate that a DataImportHandler will be used to handle external data imports. Please add:

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">  
<lst name="defaults">
<str name="config">data-config.xml</str>
</lst>
</requestHandler>

In the same conf directory, create a new data-config.xml file and put in:

<dataConfig>
<dataSource name="MongoSource" type="MongoDataSource" host="localhost" port="27017" database="dbname" username="userid" password="password"/>
<document name="import">
     <entity processor="MongoEntityProcessor"            
             datasource="MongoSource"
             transformer="MongoMapperTransformer"
             name="Users"
             collection="users"
             query="">

            <field column="_id"  name="id" mongoField="_id"/>  
            <field column="email"  name="email" mongoField="email"/>              
            <field column="firstName" name="firstName" mongoField="firstName"/> 
            <field column="lastName" name="lastName" mongoField="lastName"/>
           
            <entity name="Address"
              processor="MongoEntityProcessor"
              query="{'email':'${User.email}'}"
              collection="users"
              datasource="MongoSource"
              transformer="script:addressDataTransformer">
            </entity>
           
            <entity name="Collection"
              processor="MongoEntityProcessor"
              query="{'email':'${User.email}'}"
              collection="users"
              datasource="MongoSource"
              transformer="script:userCollectionDataTransformer">
            </entity>            
                   
       </entity>
      
       <entity processor="MongoEntityProcessor"            
             datasource="MongoSource"
             transformer="MongoMapperTransformer"
             name="Jobs"
             collection="jobs"
             query="">

            <field column="_id"  name="id" mongoField="_id"/>  
            <field column="location"  name="location" mongoField="location"/>    
            <field column="title"  name="title" mongoField="title"/>    
            <field column="description"  name="description" mongoField="description"/>                   

            <entity name="Collection"
              processor="MongoEntityProcessor"
              query=""
              collection="jobs"
              datasource="MongoSource"
              transformer="script:jobCollectionDataTransformer">
            </entity>
                   
       </entity>
 </document>

 <script><![CDATA[
function addressDataTransformer(row){
    var ret = row;
   
    if (row.get("address") !== null) {
        var address = row.get("address");
        var address1 = row.get("address");
        if (address.get("address1") !== null) {
            ret.put("address1", address.get("address1").toString());
        }
        if (address.get("address2") !== null) {
            ret.put("address2", address.get("address2").toString());
        }
        if (address.get("city") !== null) {
            ret.put("city", address.get("city").toString());
        }
        if (address.get("state") !== null) {
            ret.put("state", address.get("state").toString());
        }
        if (address.get("zip") !== null) {
            ret.put("zip", address.get("zip").toString());
        }
    }
    return ret;
}

function userCollectionDataTransformer(row){
    var ret = row;
   
    ret.put("collection", "users");
   
    return ret;
}

function jobCollectionDataTransformer(row){
    var ret = row;
   
    ret.put("collection", "jobs");
   
    return ret;
}
]]></script>
</dataConfig>



This example is a complex one. It highlights flat one-to-one fields from a Mongo document mapped to an index field. It also highlights the use of adding in static field values and using a custom data transformer to map nested Mongo document fields to index fields.


At this point the DataImportHandler should be set up. If you load up the web UI and if the Data Import section does not display an error message that there are no errors in the configuration files. The only thing needed now is to tweak the schema field values and configuration data points to customize system behavior and performance.


Automating Data Import

Now that the DataImportHandler is set up to import correctly, the final step is to schedule it to run periodically.  The simplest way to do this is to hit the URL which triggers a full or delta import like this:

http://localhost:8983/solr/CORE_NAME/dataimport?command=full-import

In the Linux environment, you can create a CRONTAB task to CURL this URL at a scheduled interval.  The Data Import section of the web UI will also let you know the last time the index was updated whether it is done via the UI or via a URL request.


Summary

Solr is a very powerful tool and the advantage of using Solr with MongoDB is that you are separating the data store from the search engine.  The data store can focus on collecting data and the search engine can focus on the queries and indexing.  You can also scale each component separately and is essential for data and resource redundancy.














MongoDB Installation On A Linux Server

Introduction

The following instructions are for the latest version of MongoDB (2.6.12) on the Linux server of choice (CentOs 7 64-bit).  CentOs is just the free version of Redhat and offers all the features minus the corporate support.  Coming from a Windows development background, this entry is a little off the main course of my usual tips and tech posts but will be useful nonetheless since Mongo is traditionally hosted in the Linux environment.  Traditional Microsoft stack applications are starting to have LAMP pieces added to the mix.  Besides, in a production environment, all of this is license free even if the Microsoft pieces are not.


Steps

  1. sudo -i
    This will get you in administrator mode to remove all restrictions moving forward.
  2. nano /etc/yum.repos.d/mongodb.repo
    This opens the yum repo file in the NANO editor, which has an easy-to-understand "interface".
  3. Paste in (for 64-bit servers, different for 32-bit)

    [mongodb] name=MongoDB Repository baseurl=http://downloads-distro.mongodb.org/repo/redhat/os/x86_64/ gpgcheck=0 enabled=1

    This stores the location of the install file to prepare for the YUM package manager to download and install.
  4. Save, exit nano
  5. yum install mongo-10gen mongo-10gen-server
    At this point the mongo service is installed and ready to use
  6. service mongod start
    This starts the service.

    service mongod stop
    This stops the service.

    service mongod restart

    You will need to restart the service if configuration items are changed.  The configuration file is here:

    /etc/mongod.conf
  7. To autostart:
    chkconfig mongod on

Management

The Mongo installation on the server comes with a web user interface assuming the server has the Apache web server installed which most likely it does.  The web UI can be accessed here from within the same machine:

http://localhost:28017

To access remotely, you would need to perform some systems administration on the firewall to open up the IP address to the public along with both ports 27017 and 28017 (direct connection and web connection default ports).  The web UI would then be accessed here:

http://IPADDRESS:28017

The web UI is nothing more than just a status and health page.  It is a good indicator that Mongo is up and running as well as provide setup and status metrics but nothing more.  To view collections and documents, I recommend standalone apps such as RoboMongo.  To connect directly to the database, you would need to provide the host and port number at the very minimum.  Again. if Robomongo is installed locally then the host would be localhost and port is 27017 and IP address if connecting remotely.  The ports as well as other configuration data points can be changed via the configuration file.


Comparison of Data Storage

There is a wealth of information on the Internet to obtain information on adding databases and collections.  A collection can be compared to a table in a relational database.  An item in a collection is a document and can be compared to a data record in a relational database.  A main difference between a document and a data record is that data records all have the same fields.  Documents in the same collection do not have to have the same fields.  They are just JSON objects that can have different fields and structures.  Since they are JSON objects, they could also have nested fields such as an address object within a user profile object.  In the relational database model, the address data records would be in a separate table and related to the user record via an address id number.  A big advantage of NoSQL over RDBMS is that there are less queries and JOINS required to obtain the same information.  In the address example only one "SELECT" query is needed for Mongo vs a SELECT and a JOIN in MSSQL.  A big drawback of Mongo in having nested objects is that indexing them becomes more complex in search platforms such as SOLR since search engines traditionally expect flat objects coming in to be indexed.  The resulting indexed documents can be multi-dimensional but that's from data transformation during the indexing process and not directly from the incoming objects.


Security

Upon setup of Mongo, there are no users and roles created and only local access is permitted.  This means that the web UI and the direction connections can only be accessed on that machine even if a login is not required.  In the event that the firewall is opened to allow for remote connections then security has to be added to prevent free-for-all access to your database.  The easiest way to do this is to create new users assigned to the database.  By adding just one user, security is turned on system-wide.  At this point the web UI will prompt you for a user name and password in order to access.  There was no indication of this in the Mongo documentation.  None of the users that you added to your custom database is accepted on this login screen.  Those users are only for direct connections when reading or writing to that database.  To enable access to the web UI, you would need to create a user in the "admin" database.


Search

Search queries can be performed directly in Mongo via JSON request objects.  This is sufficient for most scenarios.  Mongo scales horizontally so you can always add more servers to the farm for more storage and processing power as well as load balancing and redundancy.  There are also justifications for an external search mechanism instead of using Mongo and to have Mongo used solely as a data store.  It is simple enough to connect search platforms such as Apache SOLR to index documents and collections from a Mongo database.   There is no out-of-the-box support for this but there are solutions to be found  on the Internet via custom solutions.


Conclusion

MongoDB is an nifty data storage solution.  It is free, scales horizontally, which is more flexible than vertical scaling, and can be set up in a very short amount of time.  The community and the support from fellow developers is top notch.  Although there are drawbacks compared to a traditional relational database but the advantages of performance and simpler query structures far outweigh those disadvantages.