jump to navigation

Starting work on new search functionality for eZ publish based on Lucene May 11, 2006

Posted by Paul Borgermans in eZ publish, Lucene.
trackback

After exploring various options for improving the search functionality in eZ publish, I finally settled for Lucene … (the Java version) which will server as a base platform to build upon. I won't give a detailed list of pro's and con's of alternatives (like Xapian, Egothor) or why not use complete search engines like mnogosearch, htdig … because … I'm lazy. But here are my reasons to choose Lucene.

From a management / high level point of view:

  • It is a mature open source project, backed by the Apache foundation
  • It has very powerful features for information retrieval (aka search)
  • Feature-wise it is a good to excellent match to eZ publish (more below)
  • Integration with eZ publish is feasible (mabe even easy) through a PHP-Java bridge
  • It is extensible for special ranking algorithms, filtering and to implement object level security

From a technical point of view

  • It has a beautiful, simple API
  • Can be used with both PHP4 and PHP5
  • The concepts match well with eZ publish:
    • Lucene "documents" <-> eZ publish objects
    • Lucene fields within documents <-> eZ publish object attributes
    • Lucene special fields <-> eZ publish object meta-data
  • It appears fast enough for typical eZ publish use
  • Possibility to index a wide variety of file types

The "drawback" is maybe that additional software needs to be installed (Java, PHP-Java bridge) which means you will need almost full control over the servers which may not be feasible with some hosting companies.

In the next weeks, Kristof and I will be coding and prototyping … which will allow us to come up with a schedule and a more detailed feature implementation plan.

About these ads

Comments»

1. Bruce Morrison - May 12, 2006

Hi Paul

Looks as if there is a similar (non free) solution already

http://ez.no/products/solutions_and_software/mussen_multi_format_smart_search_engine

I don’t seem to recall seeing a demo though.

Looking forward to seeing the results

Cheers
Bruce

2. Paul Borgermans - May 12, 2006

Hi Bruce

I’ve seen demo’s in the past and queried about the current status (availability, development) … but no replies (yet).

Anyway, ours will be GPL or similar and we want to implement our own things to leverage the power of Lucene in eZ publish

3. Paul Forsyth - May 12, 2006

Very nice. A lucene extension has been on my todo list for a good few years now.

4. Tony Wood - May 12, 2006

I agree, a good search engine that can be used for eZ publish but also related site will be very handy.
Let me know you if need any help and I will see what we can do. We will try and coincide with a project if you have timescales.

5. Frank Dege - May 12, 2006

hi tony, hi paul,

we also implemented lucene as a search engine for ez publish in the lasrt week. we had the problem to index content from ez publish and also from another tables (econtent – our content model in ez for storing mass data).

we couldn´t use the mussen engine because we had to index also our own econtent-data inside the ez database.

lucene is a wonderful search engine: fast and it can index field based data! and we have one engine for ez publish and other data.

we also tried to use the php-implementation form zend framework. unfortunatly it has some restrictions concerning the query supported by this version. so we use the java version.

next week we will try to index 600.000 documents /products for a new project.

Frank

6. Paul Borgermans - May 12, 2006

Hi Frank,

Our goal is first to make a search plugin based on lucene, did you also take that route? Would you share your work? Ours will finally be freely available :-))

Paul

7. Ekkehard Dörre - May 26, 2006

Hi Paul,

I found LIUS

http://www.bibl.ulaval.ca/lius/index.en.html

which is based on lucene and can:
[…]The LIUS framework adds to Lucene many files format indexing fonctionalities as: Ms Word, Ms Excel, Ms PowerPoint, RTF, PDF, XML, HTML, TXT, Open Office suite and JavaBeans[…]

Greetings, ekke

8. Paul Borgermans - May 26, 2006

Hi Ekke

I know about LIUS, but the file formats you mentioned are already supported within ez publish … except for powerpoint and OOo … javabeans don’t count with ez publish.

For those who will use the lucene plugin later, they will have a powerpoint an OOo indexer “for free” since we can lever the OIP librray (Java) to php by using the same php-jave bridge.

I will present this is part of my talk at the summer conference ;-)

9. Ekkehard Dörre - May 27, 2006

Thanks Paul,

aah, I now understand your way of going, I had a differnt way in mind, I’m waiting for this talk, see you in Skien

greetings, ekke

10. Paul Borgermans - July 18, 2006

If you have your own app based on Lucene and want to index MS WORD,EXCEL,POWERPOINT and PDF files, take a look at how it is done in Lire or Nutch. The book “Lucene in action” is also very good resource.

To have this support in eZ publish, it is done outside the Lucene extension with dedicated scripts extracting the text.

hth

Paul


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: