jump to navigation

eZ Barcamp, Lyon: advanced searching and navigation topic January 17, 2007

Posted by Paul Borgermans in eZ publish, Lucene, Searching, Solr.
8 comments

During the international partner meeting of eZ systems in Lyon (January 24-26), there will also be a Barcamp. I will use this occasion to talk about a new advanced search and navigation engine for eZ publish that is in the works.

This search engine listens to the name Aurora, as it builds on the Apache Solr (pronounced solar) incubator project and well, eZ publish home is in Norway where aurora’s can be seen more often then here in Belgium 🙂

In fact, the Aurora plugin/engine is a follow-up to the Lucene based plugin Kristof and I developed some time ago. The features I wanted to add like faceted search/navigation, keyword highlighting, high performance caching and more are all built in the Solr backend, which in itself is based on the Lucene Java IR libraries. So instead of writing this myself with Lucene and the PHP-Java bridge, I can concentrate more on fundamental aspects. Also, no need for a PHP-Java bridge extension to be installed, the Solr backend is used over a HTTP connection. I think this is good news for all those who complained about that aspect (but you still need Java 1.5, aka J2EE) installed.

The effort for this new search plugin also has a few benefits for the larger PHP world :

  • I created a new Solr response writer in Java for PHP: no XML result parsing necessary, the results are returned as a string which can be eval’ed as a multidimensional PHP array (so PHP now joins the Ruby, Python, XSLT, XML and JSON response writers already available). The code is not in Solr yet, but will in the coming weeks.
  • A core PHP utility class/library for Solr is in the works which will form the basis for a component in the eZ components PHP library (if the eZ team accepts this of course).

And a note to the PHP lovers who do not like Java: the object/class persistence and caching for Java web applications (like Solr, which runs inside a servlet container) has no counterpart in the PHP world. The speed is simply amazing.

Cheers!

Second release of our Lucene based search plugin for eZ publish July 12, 2006

Posted by Paul Borgermans in eZ publish, Lucene.
1 comment so far

.. or an open source ECMS meets an open source enterprise level search engine ;-).

If you are using eZ publish and are in control of the servers it runs on, please test our contribution (beta) by downloading:

http://ez.no/community/contribs/applications/lucene_java_search_plugin

Though you need to install the php-java bridge as an additional php extension, it is worth the trouble if you need a good search engine. We use it already on production sites. One experimental feature added is a “more like this” module/view. You can use it in node/view templates as

<a href={concat(“/lucene/similar/”,$node.node_id)|ezurl}>Show similar objects</a>

For comments, requests or bug reports, use the comments feature here or on ez.no

In the future, we plan to add quite some more exciting features like image search by example
Happy searching!

Paul

Some eZ publish summer conference pictures July 7, 2006

Posted by Paul Borgermans in eZ publish, Lucene.
1 comment so far

Trying the picasa web album things, not bad and pretty fast 🙂

Here is a selection of my eZ publish summer conference pictures

http://picasaweb.google.com/Paul.Borgermans/20060623Ezsummer

FYI, flickr contains also quite a few pictures of this magnificent event/conference:

http://www.flickr.com/photos/tags/ezconference2006/
Enjoy!

–paul

Prototyping lucene based search in eZ publish: first results June 2, 2006

Posted by Paul Borgermans in eZ publish, Lucene.
2 comments

After roughly 12 hours coding (including removing some dust from my Java skills), Kristof and I reached a first milestone in our Lucene project: full text search implemented as a normal search plug-in for eZ publish. First impressions in a nutschell: fast, accurate and even faster 🙂

Even though we use the PHP-Java bridge in a non-optimal development mode with default Java settings (like low memory), indexing and searching is way faster than the ezsearch plugin. Accurate benchmarks are lacking at this point, we'll do that later but it appears to be at least an order of magnitude faster. A typical search over an index with ~4000 documents takes 0.050 secs; including the step to fetch the content objects from the ez publish database for the displayed hits (10 at a time) this amount increases to typically 0.1 secs (machine is a dual 3 Ghz CPU 64bit, 4GB RAM).

Some technical details on the milestone reached:

  • the plugin is written in php, and calling the lucene classes and methods (Java version) is implemented through the php-java bridge
  • object attributes are indexed as separate fields, as well as object meta-data (owner, dates, section, class, path…)
  • for lucene users: the analyzer used is the multi-field analyzer
  • full text search is over all fields (our test case: 137 fields)
  • all the typical richness of lucene queries (Boolean searches on keywords, fuzzy matching, keyword boosting, …)
  • no field (attribute) or document (object) boosting during the indexing phase at this time, using only the standard heuristics of lucene (like short fields with keyword hits increase the relevance ranking more than long fields with matching keywords)
  • sub-tree searches are implemented
  • no security yet

The next phase will consists of

  • experimenting with boosting factors during indexing (for example use the number of reverse object relations to determine a boost factor at the object level, keyword attributes are more important than the rest, using configured boost factors from an ini file, …)
  • implementing the advanced ez search interface (class/attribute filtering, range queries including dates)
  • implementing a normal query interface (mainly for template programmers who want to include dedicated search results on certain pages/node views)
  • implementing security: this will be done in Java by writing a dedicated lucene filter which interrogates the database like the ez publish search does

Stay tuned, source code will be released around the ez summer conference where I 'll have a talk on this and other developments done here 😉

Starting work on new search functionality for eZ publish based on Lucene May 11, 2006

Posted by Paul Borgermans in eZ publish, Lucene.
10 comments

After exploring various options for improving the search functionality in eZ publish, I finally settled for Lucene … (the Java version) which will server as a base platform to build upon. I won't give a detailed list of pro's and con's of alternatives (like Xapian, Egothor) or why not use complete search engines like mnogosearch, htdig … because … I'm lazy. But here are my reasons to choose Lucene.

From a management / high level point of view:

  • It is a mature open source project, backed by the Apache foundation
  • It has very powerful features for information retrieval (aka search)
  • Feature-wise it is a good to excellent match to eZ publish (more below)
  • Integration with eZ publish is feasible (mabe even easy) through a PHP-Java bridge
  • It is extensible for special ranking algorithms, filtering and to implement object level security

From a technical point of view

  • It has a beautiful, simple API
  • Can be used with both PHP4 and PHP5
  • The concepts match well with eZ publish:
    • Lucene "documents" <-> eZ publish objects
    • Lucene fields within documents <-> eZ publish object attributes
    • Lucene special fields <-> eZ publish object meta-data
  • It appears fast enough for typical eZ publish use
  • Possibility to index a wide variety of file types

The "drawback" is maybe that additional software needs to be installed (Java, PHP-Java bridge) which means you will need almost full control over the servers which may not be feasible with some hosting companies.

In the next weeks, Kristof and I will be coding and prototyping … which will allow us to come up with a schedule and a more detailed feature implementation plan.