Steven Zhao's Blog

Thursday, March 23, 2017

Sitecore: Creating Computed Fields in Sitecore 8

Background

Sitecore has a built-in Lucene search engine that indexes all the standard template fields and values. You can easily add new fields to templates that are created along the way. Adding a field to the index for a standard number or string field is relatively straight-forward. Just define an XML element in the

\App_Config\Include\Sitecore.ContentSearch.Lucene.DefaultIndexConfiguration.config

under the

<fieldNames hint="raw:AddFieldByFieldName">

section and you are all set. But what about fields that need some level of manipulation before being indexed? For example, a content item could have a datetime field which consists of data for a full timestamp but all you need is the month to categorize a group of content items. We can pull out all the items and then programmatically pull out the month per item but efficiency is lost since so much processing has to be done on runtime. A better way to approach this is to do all the processing at index time and create computed fields to accomplish this.

Approach

Creating computed fields in Sitecore 8 has been made easier than previous versions. You use to inherit from an interface and have to implement all the properties and methods. Starting with version 8, there is now an abstract class to inherit from and minimal implementation is needed. A barebones class for a computed field looks something like this:

using System;
using System.Collections.Generic;

using Sitecore.ContentSearch;
using Sitecore.ContentSearch.ComputedFields;
using Sitecore.Data.Fields;
using Sitecore.Data.Items;

namespace MyProject.Utility.Search.ComputedFields
{
    public class TemplateIdAsString : AbstractComputedIndexField
    {
        public override object ComputeFieldValue(IIndexable indexable)
        {
            Item obj = (Item)(indexable as SitecoreIndexableItem);

            Sitecore.Data.ID templateId = obj.TemplateID;
            if (!templateId.IsNull)
                return (object)templateId.ToString().Replace("-",string.Empty)...

            return (object)"";
        }
    }
}

This computed field basically removes all the dashes from the templateID guid and converts to string format. All the data manipulation and processing is done inside the ComputeFieldValue method. When all work is done here, make sure you create a computed field definition in the config file above in the section

<fields hint="raw:AddComputedIndexField">

A simple example for the above field is:

<field fieldName="templateidasstring">MyProject.Utility.Search.ComputedFields.TemplateIdAsString,MyProject.Utility</field>

This basically sets all properties of the field to the default values. If you need to stray from the defaults, you can go back to the

<fieldNames hint="raw:AddFieldByFieldName">

section and define the field again there with custom field properties like this:

<field fieldName="templateidasstring" storageType="YES" indexType="TOKENIZED"    vectorType="NO" type="System.String" settingType="Sitecore.ContentSearch.LuceneProvider.LuceneSearchFieldConfiguration, Sitecore.ContentSearch.LuceneProvider" />


Run the index manager and verify everything in Luke that the computed field is now part of the index.

You have now created your first computed field in Sitecore. You can search on computed fields in a traditional Lucene search or via the ItemService API.

Friday, October 28, 2016

Coveo Indexes and Search Errors in Sitecore

Background

If anyone has worked with Coveo for Sitecore, you would know that Coveo will create new indexes for the master, core, and web indexes along with any other custom indexes you have defined in your configuration files. It is very easy to define new Coveo indexes and is almost, if not exactly, the same as Lucene indexes. What you may not know is that there are some hidden problems that could arise from using your own custom Coveo indexes.

Problem

Defining a Coveo or Lucene index is simple. Drop in an <index> element in //sitecore/contentSearch/configuration/indexes in the XML config files. Give it an id, assign a crawler (most likely the default SitecoreItemCrawler), specify the content database, and define the root, which is where the crawler begins to crawl. The crawler will crawl the root and all its child items but not its parents. By default, all Coveo index definitions have the root listed as "/sitecore", the root item and therefore everything is crawled and indexed. But what if we don't care about all the noise? We may not care for templates and layout items. What if we just want to get all content items pertaining to the website? You would set the root as something like "/sitecore/content/home" and this would yield the desired results. Your index has shrunk considerable and only page-level website content items will be included.

Now the problem arises when you try to perform a search on the Sitecore admin UI. There are two places to search. The search bar on top of the content tree in the Content Editor and the search bar in the Windows-style desktop taskbar. Searching in the Content Editor yields desired results but searching in the taskbar will yield an undesirable error. You will not get any results even though the index is fully functional. Investigating the log files will reveal that there are no indexes that contain the root "/sitecore" content item. Apparently, the taskbar search looks through all your index collections and at least one index has to contain the root item. But adding the root item would entail adding all the noise back in to all your indexes, which is not an acceptable solution.

Solution

A simple solution in this scenario is to create your own crawler. The crawler can be derived from the default SitecoreItemCrawler. This crawler should function the same way, for the most part. It also has an IsExcludedFromIndex boolean method that you could override to determine what is excluded. By doing this we can use the exact same XML structure to define the content database and root item. This is how the crawler would look like:

public class SingleItemCrawler : SitecoreItemCrawler
{
   protected override bool IsExcludedFromIndex(SitecoreIndexableItem indexable, bool checkLocation = false)
   {
      return indexable.AbsolutePath != RootItem.Paths.ContentPath ? true : base.IsExcludedFromIndex(indexable);
   }

}

A one line method solved the problem. The overriden IsExcludedFromIndex method basically checks the root item from the XML and compares it to the item being crawled and if they don't match then exclude it. This ensures the result set is a single item. Use this single item crawler in conjunction with the default crawler for the page-level content items and you will get a resulting index with all the desired page-level items and the required root item in the content tree.

Tuesday, July 5, 2016

Install Apache Nutch 2.3.1 On Linux

Introduction

Apache Nutch is a web scraper. It takes a list of seed URLs, generates relevant URLS and then parses the content in each of the web pages by stripping HTML tags. It is the gold standard of all web scrapers available today. This guide is specifically designed for version 2.3.1 which is the latest version as of now with installation on a CentOS 7.0 linux server.

Requirements

The major difference between Nutch 1 and Nutch 2 is that Nutch 2 stores all results in a data store. The default data store is Apache HBase, but you can also use MongoDB. Since Nutch 2.3.1 as of now is distributed via source code and not a binary yet, you will need to compile the code locally after adjusting all the configuration parameters. The compiler of choice is Apache Ant.

Install Java

Install the latest version of Java if you don't already have it installed using Yum.

sudo -i
yum install java-1.8.0-openjdk.x86_64

Install Apache HBase

cd /tmp
wget http://archive.apache.org/dist/hbase/hbase-0.98.8/hbase-0.98.8-hadoop2-bin.tar.gz
cd /usr/share
tar zxf /tmp/hbase-0.98.8-hadoop2-bin.tar.gz

Edit: /usr/share/hbase-0.98.8-hadoop2/conf/hbase-site.xml

Insert this block of code to set up the storage location:

<property>
<name>hbase.rootdir</name>
<value>/usr/share/hbase-0.98.8-hadoop2/data/</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>false</value>
</property>

Commands:

/usr/share/hbase-0.98.8-hadoop2/bin/start-hbase.sh
/usr/share/hbase-0.98.8-hadoop2/bin/hbase shell

At this point, if you are taken to the HBase prompt, you have installed HBase successfully. To further test, try listing existing tables and creating new ones.

> list
> create 'test','cf'

Install Apache Ant

yum install ant

Install Nutch

cd /tmp
wget http://apache.mesi.com.ar/nutch/2.3.1/apache-nutch-2.3.1-src.tar.gz
cd /usr/share
tar zxf /tmp/apache-nutch-2.3.1-src.tar.gz

Edit and add to: /usr/share/apache-nutch-2.3.1/conf/nutch-site.xml

<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
<description>Default class for storing data</description>
</property>

Edit the cluster name to match the name on the Elasticsearch config.

Edit /usr/share/apache-nutch-2.3.1/ivy/ivy.xml
Ensure the following lines are there and uncommented:

<dependency org="org.apache.gora" name="gora-core" rev="0.6.1" conf="*->default"/>
<dependency org="org.apache.gora" name="gora-hbase" rev="0.6.1" conf="*->default" />
<dependency org="org.apache.hbase" name="hbase-common" rev="0.98.8-hadoop2" conf="*->default" />

Edit /usr/share/apache-nutch-2.3.1/conf/gora.properties
Ensure this is there:

gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

Compile Nutch from source code:

cd /usr/share/apache-nutch-2.3.1
ant runtime

If failed:

ant clean

Update configs

ant runtime

If build successful, then a compiled runtime folder is created

Nutch Commands

The following commands can be used to run through a simple scenario consisting of a web crawl and indexing to ES.

Make sure you run these inside the runtime/local folder.

mkdir urls
echo "https://en.wikipedia.org" > urls/seed.txt
bin/nutch inject urls/seed.txt
bin/nutch generate -topN 40
bin/nutch fetch -all
bin/nutch parse -all
bin/nutch updatedb -all
bin/nutch index -all (This sends data over to ES. Will not work until ES is installed and running. A different, but similar command is used for another indexer such as Solr.)

Automation

The commands above can been included in a script file (/scripts/scrape). Please set up a crontab cronjob to run the script at a desired time interval.

Install Elasticsearch

The version of ES that works in this whole setup is 1.7.3. Please refer to the guide here for installing ES:

http://mrstevenzhao.blogspot.com/2016/06/elasticsearch-install-on-linux.html

Wednesday, June 29, 2016

Elasticsearch Install on Linux

Introduction

Elasticsearch (ES) is a very easy and powerful search platform. Based on the trusted and beloved Lucene platform, ES offers everything Lucene offers and more such as replication and sharding for horizontal scaling and faceted filtration (aggregations) of results and you can get an instance up and running in less time than Solr. This is one of the biggest advantages over Solr. All you have to do is install, add documents, and then query for results. That's all there is to get started. Of course there are much more to it if you want to use the platform to maximize its benefits, but you can have a simple system up and running in no time.

Installation

sudo -i (root user)
yum install java-1.8.0-openjdk.x86_64 (latest Java)
java -version (verify)
wget https://download.elastic.co/elasticsearch/release/org/elasticsearch/distribution/rpm/elasticsearch/2.3.3/elasticsearch-2.3.3.rpm (download)
rpm -ivh elasticsearch-2.3.3.noarch.rpm (install)

Elasticsearch is now installed in:
/usr/share/elasticsearch/

Configuration files are in:
/etc/elasticsearch

Init script in:
/etc/init.d/elasticsearch

Execution Commands:
service elasticsearch start
service elasticsearch stop
service elasticsearch restart

Configuration

Two configuration files are located in: /etc/elasticsearch

elasticsearch.yml - Everything except logging
logging.yml - Logging (by default all logs are in: /var/log/elasticsearch)

For the most part, all your config changes should take place in the elasticsearch.yml file. This is where you can determine if everything should run locally only or could be bound to an external IP address. In a production environment you would choose the localhost option. But if you do have to bind to an external IP for the purpose of testing and visualizing your data, please make sure these config line items are in there:

network.host: localhost
network.bind_host: 0.0.0.0
http.port: 9200

In your firewall settings, please open up port 9200 to all incoming traffic. This is the default port for ES.

Usage

ES does not officially have a web UI that displays system status and health like Solr. ES communicates purely on a RESTful API to perform all its tasks. Therefore if you point your browser to:

http://IP:9200

All you will see is a JSON response object that should tell you basic info such as cluster and node info if the installation was successful. Some other URLs you can try are:

http://IP:9200/
http://IP:9200/_nodes
http://IP:9200/_cluster/health

GUI Tools

Having a tool to visualize your data would be great. There are tools available and they all make RESTful calls to ES and parse the resulting JSON response into a more pleasant visual experience that you can see on your screen. One such tool is the Head Plugin. The Head Plugin allows for visual display of system status instead of JSON data and performs search queries.

To install, issue the following commands:

cd /usr/share
elasticsearch/bin/plugin -install mobz/elasticsearch-head

To view, point your browser here:
http://IP:9200/_plugin/head/

With this tool, you can see all your clusters, nodes. shards, as well as issue search commands and much more.

Another option is the Kibana App. This app also allows for visualizations but can render results in graphs and charts for even more detailed analyses.

To install, please issue the following commands:

sudo -i
rpm --import https://packages.elastic.co/GPG-KEY-elasticsearch (download and install public signing key)
nano /etc/yum.repos.d/kibana.repo (create repo)
Copy, paste, save, exit:

[kibana-4.5]
name=Kibana repository for 4.5.x packages
baseurl=http://packages.elastic.co/kibana/4.5/centos
gpgcheck=1
gpgkey=http://packages.elastic.co/GPG-KEY-elasticsearch
enabled=1
yum install kibana
service kibana start

To view, please point your browser here:

http://IP:5601/

Again, please make sure your firewall inbound rules allow for TCP connections from port 5601

The default config file should allow for external IP connections but if you want to disable that in the production environment, do it here and change 0.0.0.0 to localhost:

/opt/kibana/config/kibana.yml

Mappings and Schemas

A major difference between ES and Solr is that in Solr, the index schemas have to be predefined before you can add to the indexes. Types such as strings or integers and whether or not a field is analysed or not have to be defined right at the beginning. With ES, you can start adding documents to an index right from the start. If an index does not exist, it will be created. There are no schemas in ES, rather mappings. Mappings define what the fields are and what their datatypes are. If a mapping is not defined, then ES will be smart enough to guess what it is based on the first document added to the index. The only complication in this approach is if your data changes after the first document was added. If you keep your data consistent then there is no reason to worry. Even if you change your mind in the future, it is easy enough to define a new mapping in a new index and re-index everything from the old index into the new. This is necessary because once a field is defined in a custom mapping, you cannot change it.

Sample Commands

Add a new document and show response formatted neatly:

curl -X PUT 'localhost:9200/tutorial/helloworld/1?pretty' -d '
    {
      "message": "Hello People!"
    }'

This document will be stored in the "tutorial" index of type "helloworld"

Retrieve a document and show response formatted neatly:

curl -X GET 'localhost:9200/tutorial/helloworld/1?pretty'

Search for a term in specific fields:

curl -X GET "http://localhost:9200/_search?pretty" -d'
{
    "query": {
        "query_string": {
            "query": "hello",
            "fields": ["id","message"]
        }
    }
}'

Retrieve a mapping:

curl -X GET 'http://localhost:9200/_mapping?pretty'

Add a new field to existing mapping:

curl -X PUT 'http://localhost:9200/tutorial/helloworld/_mapping' -d '
{
    "helloworld" : {
        "properties" : {
            "message2" : {"type" : "string", "store" : "true"}
        }
    }
}'

Re-index documents from one index to another without retaining versions:

    curl -X POST 'localhost:9200/tutorial2' -d '{
     "mappings" : {
      "helloworld2" : {
       "properties" : {
        "message" : { "type" : "string", "store": "true" }
       }
      }
     }
    }'

    curl -X POST 'localhost:9200/_reindex' -d '{
     "source" : {
      "index" : "tutorial"
     },
     "dest" : {
      "index" : "tutorial2",
      "version_type": "internal"
     }
    }'

Summary

We have just scratched the surface of what ES can do. ES is great if time is crucial and minimal time is allocated for setup. A common pitfall is that since ES is so easy to set up, there could be long term issues for the novice developer who end up spending time re-configuring the indexes via mapping updates and re-indexes. Solr ensures that there are no data inconsistencies because you have to know in advance how the data is suppose to be. ES allows for changes and changes can create problems with searching. You can change datatypes as frequently as you want when adding documents but the search will run into errors when the engine looks at the mappings to determine datatype and documents have fields that are inconsistent with the definitions there. A dynamic mapping is created upon initial document insertion. Again, there are others reasons to change mapping definitions besides changing your mind on field datatypes. A good reason could be to index fields in a different way such as allowing for case-insensitive searching as well as partial matches in strings.

Friday, May 13, 2016

Apache Solr Install w/ MongoDB Indexing

Introduction

Solr is an awesome search platform. Based on the trusted and beloved Lucene platform, Solr offers everything Lucene offers and more such as replication and sharding for horizontal scaling and faceted filtration of results and multi-core separation. SolrCloud is a Solr offering with sharding in place for real time distributed reading and writing from a farm of servers designed to handle extra storage and traffic. Traditional Solr setups can also be distributed but only in the form of replication where multiple slave nodes can periodically pull from the master index and all search queries are performed on the slaves. This is pull replication where SolrCloud implements real-time push replication.

Solr vs ElasticSearch (ES)

Both are Java-based and both are extended forms of Lucene so what sets them apart? For the most part they are very similar with a few differences. For the average business case with a need for a highly-scalable search platform, either one will be fine. These are some of the differences between both platforms.

ES is a newer platform which makes them have the vibe of being more modern and distributed performance is handled out-of-the-box with minimal configurations and setup steps. SolrCloud requires a separate distributed environment system such as Apache Zookeeper to sync up all the shards. This requires additional steps as compared to ES in setting up the distributed environment which has the built-in ZenKeeper tool as part of the install and is distributed with very little effort. With this in mind, you could say Solr setups are for the more serious developers since more in-depth knowledge is required to get things up and running. Also, ES setups are known to have long-term problems because the initial setup is not very demanding and since "anyone can get it up and running" they may lack the integrity or knowledge to maintain it.
Solr is an older product by a few years and the community reflects that. The community is bigger and more established with more resources available on the web to address issues that you may encounter. Solr also has better official documentation.
ES has a more robust analytics suite. This data proves useful for marketing purposes.

Steps To Install Solr 6.0.0 in Linux

yum -y update
Ensure everything is up to date first
java -version
Check java version
yum list available java*
Check all available java versions in the YUM package manager
yum install java-1.8.0-openjdk.x86_64
Install the latest version of java if not installed already
cd /tmp
Change the working directory to temp to prepare for download
wget http://apache.org/dist/lucene/solr/6.0.0/solr-6.0.0.tgz
Download the solr install
tar -zxvf solr-6.0.0.tgz
Uncompress file
cd solr-6.0.0
bin/install_solr_service.sh /tmp/solr-6.0.0.tgz
Run the script to install solr

Solr service is now insalled. Installation dir is /opt/solr. Data dir is /var/solr.

Solr user and group created by script and is used to create Solr cores.
View admin web UI here:

http://localhost:8983/solr/

or

http://IPADDRESS:8983/solr/ (remotely)

This may require updating the firewall for public access and opening up the default port 8983.

Create Sample Core and Load Documents

sudo chown -R solr:solr /var/solr/
Make solr user the owner of dir
sudo chown -R solr:solr /var/solr
cd /opt/solr
su - solr
Switch to solr user
bin/solr create -c documents
Create documents core, if permissions are correct you will not get errors
bin/post -c documents docs
Load up the core with the test html docs, do not use if not testing

Operations

All configuration data points will reside in the folder of the core in the file system. In our case, the path is:

/var/solr/CORE_NAME/conf

All changes to the config files will only be reflected in the system after restarting. These are the commands:

service solr start
service solr stop
service solr restart

Data Import Handler and MongoDB

The example above is a simple scenario where static files in the same file system are pulled in and indexed. In a real environment, you may have to index data records in a relational database or even NoSQL databases such as MongoDB. All this is done using the DataImportHandler. There is support for indexing remote files and database records along with other sources but there is no native support for indexing Mongo collections and documents. At this time a custom solution is needed. You will need to download the latest versions of the following JAR files:

solr-dataimporthandler-x.x.x.jar
solr-mongo-importer-x.x.x.jar
mongo-java-driver-x.x.jar

These files have to be dropped into:

/opt/solr/dist

The core configuration files are at:/var/solr/CORE_NAME/conf
In this folder will be XML files used to define the system behavior. You have to make the system aware of the JAR files that are dropped in by updating the solrconfig.xml file. Please add in with the other lib declarations:

<lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-dataimporthandler-.*\.jar" />
<lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-mongo-importer-.*\.jar" />
<lib dir="${solr.install.dir:../../../..}/dist/" regex="mongo-java-driver-.*\.jar" />

The schema file is managed-schema.xml. In this file is where all field names and definitions are stored. Please add the custom fields for the indexes here. They will look something like this:

<field name="firstName" type="string" indexed="true" stored="true"/>
<field name="email" type="string" indexed="true" stored="true"/>

etc...

In the solrconfig.xml file you will need to indicate that a DataImportHandler will be used to handle external data imports. Please add:

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">  
<lst name="defaults">
<str name="config">data-config.xml</str>
</lst>
</requestHandler>

In the same conf directory, create a new data-config.xml file and put in:

<dataConfig> <dataSource name="MongoSource" type="MongoDataSource" host="localhost" port="27017" database="dbname" username="userid" password="password"/> <document name="import"> <entity processor="MongoEntityProcessor" datasource="MongoSource" transformer="MongoMapperTransformer" name="Users" collection="users" query=""> <field column="_id" name="id" mongoField="_id"/> <field column="email" name="email" mongoField="email"/> <field column="firstName" name="firstName" mongoField="firstName"/> <field column="lastName" name="lastName" mongoField="lastName"/> <entity name="Address" processor="MongoEntityProcessor" query="{'email':'${User.email}'}" collection="users" datasource="MongoSource" transformer="script:addressDataTransformer"> </entity> <entity name="Collection" processor="MongoEntityProcessor" query="{'email':'${User.email}'}" collection="users" datasource="MongoSource" transformer="script:userCollectionDataTransformer"> </entity> </entity> <entity processor="MongoEntityProcessor" datasource="MongoSource" transformer="MongoMapperTransformer" name="Jobs" collection="jobs" query=""> <field column="_id" name="id" mongoField="_id"/> <field column="location" name="location" mongoField="location"/> <field column="title" name="title" mongoField="title"/> <field column="description" name="description" mongoField="description"/> <entity name="Collection" processor="MongoEntityProcessor" query="" collection="jobs" datasource="MongoSource" transformer="script:jobCollectionDataTransformer"> </entity> </entity> </document> <script><![CDATA[ function addressDataTransformer(row){ var ret = row; if (row.get("address") !== null) { var address = row.get("address"); var address1 = row.get("address"); if (address.get("address1") !== null) { ret.put("address1", address.get("address1").toString()); } if (address.get("address2") !== null) { ret.put("address2", address.get("address2").toString()); } if (address.get("city") !== null) { ret.put("city", address.get("city").toString()); } if (address.get("state") !== null) { ret.put("state", address.get("state").toString()); } if (address.get("zip") !== null) { ret.put("zip", address.get("zip").toString()); } } return ret; } function userCollectionDataTransformer(row){ var ret = row; ret.put("collection", "users"); return ret; } function jobCollectionDataTransformer(row){ var ret = row; ret.put("collection", "jobs"); return ret; } ]]></script> </dataConfig>

This example is a complex one. It highlights flat one-to-one fields from a Mongo document mapped to an index field. It also highlights the use of adding in static field values and using a custom data transformer to map nested Mongo document fields to index fields.

At this point the DataImportHandler should be set up. If you load up the web UI and if the Data Import section does not display an error message that there are no errors in the configuration files. The only thing needed now is to tweak the schema field values and configuration data points to customize system behavior and performance.

Automating Data Import

Now that the DataImportHandler is set up to import correctly, the final step is to schedule it to run periodically. The simplest way to do this is to hit the URL which triggers a full or delta import like this:

http://localhost:8983/solr/CORE_NAME/dataimport?command=full-import

In the Linux environment, you can create a CRONTAB task to CURL this URL at a scheduled interval. The Data Import section of the web UI will also let you know the last time the index was updated whether it is done via the UI or via a URL request.

Summary

Solr is a very powerful tool and the advantage of using Solr with MongoDB is that you are separating the data store from the search engine. The data store can focus on collecting data and the search engine can focus on the queries and indexing. You can also scale each component separately and is essential for data and resource redundancy.

MongoDB Installation On A Linux Server

Introduction

The following instructions are for the latest version of MongoDB (2.6.12) on the Linux server of choice (CentOs 7 64-bit). CentOs is just the free version of Redhat and offers all the features minus the corporate support. Coming from a Windows development background, this entry is a little off the main course of my usual tips and tech posts but will be useful nonetheless since Mongo is traditionally hosted in the Linux environment. Traditional Microsoft stack applications are starting to have LAMP pieces added to the mix. Besides, in a production environment, all of this is license free even if the Microsoft pieces are not.

Steps

sudo -i
This will get you in administrator mode to remove all restrictions moving forward.
nano /etc/yum.repos.d/mongodb.repo
This opens the yum repo file in the NANO editor, which has an easy-to-understand "interface".
Paste in (for 64-bit servers, different for 32-bit)

[mongodb] name=MongoDB Repository baseurl=http://downloads-distro.mongodb.org/repo/redhat/os/x86_64/ gpgcheck=0 enabled=1

This stores the location of the install file to prepare for the YUM package manager to download and install.
Save, exit nano
yum install mongo-10gen mongo-10gen-server
At this point the mongo service is installed and ready to use
service mongod start
This starts the service.

service mongod stop
This stops the service.

service mongod restart

You will need to restart the service if configuration items are changed. The configuration file is here:

/etc/mongod.conf
To autostart:
chkconfig mongod on

Management

The Mongo installation on the server comes with a web user interface assuming the server has the Apache web server installed which most likely it does. The web UI can be accessed here from within the same machine:

http://localhost:28017

To access remotely, you would need to perform some systems administration on the firewall to open up the IP address to the public along with both ports 27017 and 28017 (direct connection and web connection default ports). The web UI would then be accessed here:

http://IPADDRESS:28017

The web UI is nothing more than just a status and health page. It is a good indicator that Mongo is up and running as well as provide setup and status metrics but nothing more. To view collections and documents, I recommend standalone apps such as RoboMongo. To connect directly to the database, you would need to provide the host and port number at the very minimum. Again. if Robomongo is installed locally then the host would be localhost and port is 27017 and IP address if connecting remotely. The ports as well as other configuration data points can be changed via the configuration file.

Comparison of Data Storage

There is a wealth of information on the Internet to obtain information on adding databases and collections. A collection can be compared to a table in a relational database. An item in a collection is a document and can be compared to a data record in a relational database. A main difference between a document and a data record is that data records all have the same fields. Documents in the same collection do not have to have the same fields. They are just JSON objects that can have different fields and structures. Since they are JSON objects, they could also have nested fields such as an address object within a user profile object. In the relational database model, the address data records would be in a separate table and related to the user record via an address id number. A big advantage of NoSQL over RDBMS is that there are less queries and JOINS required to obtain the same information. In the address example only one "SELECT" query is needed for Mongo vs a SELECT and a JOIN in MSSQL. A big drawback of Mongo in having nested objects is that indexing them becomes more complex in search platforms such as SOLR since search engines traditionally expect flat objects coming in to be indexed. The resulting indexed documents can be multi-dimensional but that's from data transformation during the indexing process and not directly from the incoming objects.

Security

Upon setup of Mongo, there are no users and roles created and only local access is permitted. This means that the web UI and the direction connections can only be accessed on that machine even if a login is not required. In the event that the firewall is opened to allow for remote connections then security has to be added to prevent free-for-all access to your database. The easiest way to do this is to create new users assigned to the database. By adding just one user, security is turned on system-wide. At this point the web UI will prompt you for a user name and password in order to access. There was no indication of this in the Mongo documentation. None of the users that you added to your custom database is accepted on this login screen. Those users are only for direct connections when reading or writing to that database. To enable access to the web UI, you would need to create a user in the "admin" database.

Search

Search queries can be performed directly in Mongo via JSON request objects. This is sufficient for most scenarios. Mongo scales horizontally so you can always add more servers to the farm for more storage and processing power as well as load balancing and redundancy. There are also justifications for an external search mechanism instead of using Mongo and to have Mongo used solely as a data store. It is simple enough to connect search platforms such as Apache SOLR to index documents and collections from a Mongo database. There is no out-of-the-box support for this but there are solutions to be found on the Internet via custom solutions.

Conclusion

MongoDB is an nifty data storage solution. It is free, scales horizontally, which is more flexible than vertical scaling, and can be set up in a very short amount of time. The community and the support from fellow developers is top notch. Although there are drawbacks compared to a traditional relational database but the advantages of performance and simpler query structures far outweigh those disadvantages.

Thursday, August 20, 2015

Sitecore: Multisite Setup with Multilingual

This is a followup blog entry to the custom language fallback strategy located here:

http://mrstevenzhao.blogspot.com/2015/08/sitecore-custom-language-fallback.html

Background

Clients usually go the route of single-Sitecore instance, multi-site setup for the purposes of cutting costs and maximizing reusability of components. This approach usually complicates development but there are loads of advantages as well, such as writing less code and less copy-and-pasting.

Scenario

A very realistic scenario in the corporate world is a large parent company with many different brands. All the brands will have their own individual sites. In the Sitecore world, each of these brand websites can be an individual site in the content tree. But what about when each of these brands have multiple international versions? It would be easy if we could just use the "sc_lang" parameter to switch languages but what if they need to be hosted under different domain name suffixes. For example:

www.domain.com - main site
www.domain.fr - main site with French content
www.domain.com.au - main site with Australian English content

In this scenario, we don't want to create three website nodes in the content tree. That would defeat the purpose of language fallback and content resusability. The only way is to ensure all three sites are reading from the same node, except that depending on the hostname suffix, we choose the corresponding language content.

Setup

To start we must ensure that there is a place to associate different languages with different domain name suffixes. To do this, you can modify the system language template to include a new field:

Then if you guess correctly, we would eventually have to query all the system languages for the value of the Domain Name Suffix field and value. We COULD iterate item by item in this folder but the better way would be to create a custom Lucene index that contains all these items and their fields and values. This part is up to you to create.

As for defining the sites, you can either do it the out-of-the-box way and add to the <site> entries in the web.config file or you can do it the way I did it with dynamic configurations without having to modify any config files.

http://mrstevenzhao.blogspot.com/2014/04/sitecore-multi-site-setup-wo-updating.html

Solution

Eventually we will need a way to route the hostname in the browser to the correct context website with the corresponding context language. This is a two step process:

1) Find the correct "parent" site, usually the ".com" version.
2) Find the content item for the language according to the domain name suffix.

Step one can be done in the SiteResolver processor of the HtttpRequestBegin pipeline. It is best if you create a new custom version of this class or create a a derived version of the default one. In a nutshell, you would have to:

a) Check the URL hostname and get the value of the name without the suffix.
b) Look through all the website node names without suffixes in the content tree and try to find a match with the value from step a.
c) If a node is found, then you have found the context site.

To enhance performance you can utilize HttpContext caching and custom Lucene indexes to store all website nodes so you can just query against an index instead of iterating though items in the tree.

Step two can be done by creating a custom LanguageResolver that is derived from the default LanguageResolver. Once you set the context language, the ItemProvider that actually gets the language version of the content items will do everything automatically. Basically, here are the steps for the custom LanguageResolver:

a) Check the url for the language parameter "sc_lang" to see if language is already being set manually. If so then we just let the default behavior take place.
b) If not, then we check the domain name suffix. If the domain suffix matches any of the system languages on the value of the field "Domain Name Suffix", then we have found the matching language. Set the context language to be that language using the ISO code.

Summary

This is the high-level implementation of the strategy. We do not want to create site nodes for every single language of a domain so we have to check that the current hostname in the browser matches the "main" site hostname in the content tree. We then take the suffix and determine which system language is mapped to that suffix and set the context language.