Starting work on new search functionality for eZ publish based on Lucene May 11, 2006
Posted by Paul Borgermans in Lucene, eZ publish.trackback
After exploring various options for improving the search functionality in eZ publish, I finally settled for Lucene … (the Java version) which will server as a base platform to build upon. I won't give a detailed list of pro's and con's of alternatives (like Xapian, Egothor) or why not use complete search engines like mnogosearch, htdig … because … I'm lazy. But here are my reasons to choose Lucene.
From a management / high level point of view:
- It is a mature open source project, backed by the Apache foundation
- It has very powerful features for information retrieval (aka search)
- Feature-wise it is a good to excellent match to eZ publish (more below)
- Integration with eZ publish is feasible (mabe even easy) through a PHP-Java bridge
- It is extensible for special ranking algorithms, filtering and to implement object level security
From a technical point of view
- It has a beautiful, simple API
- Can be used with both PHP4 and PHP5
- The concepts match well with eZ publish:
- Lucene "documents" <-> eZ publish objects
- Lucene fields within documents <-> eZ publish object attributes
- Lucene special fields <-> eZ publish object meta-data
- It appears fast enough for typical eZ publish use
- Possibility to index a wide variety of file types
The "drawback" is maybe that additional software needs to be installed (Java, PHP-Java bridge) which means you will need almost full control over the servers which may not be feasible with some hosting companies.
In the next weeks, Kristof and I will be coding and prototyping … which will allow us to come up with a schedule and a more detailed feature implementation plan.
Hi Paul
Looks as if there is a similar (non free) solution already
http://ez.no/products/solutions_and_software/mussen_multi_format_smart_search_engine
I don’t seem to recall seeing a demo though.
Looking forward to seeing the results
Cheers
Bruce
Hi Bruce
I’ve seen demo’s in the past and queried about the current status (availability, development) … but no replies (yet).
Anyway, ours will be GPL or similar and we want to implement our own things to leverage the power of Lucene in eZ publish
Very nice. A lucene extension has been on my todo list for a good few years now.
I agree, a good search engine that can be used for eZ publish but also related site will be very handy.
Let me know you if need any help and I will see what we can do. We will try and coincide with a project if you have timescales.
hi tony, hi paul,
we also implemented lucene as a search engine for ez publish in the lasrt week. we had the problem to index content from ez publish and also from another tables (econtent - our content model in ez for storing mass data).
we couldn´t use the mussen engine because we had to index also our own econtent-data inside the ez database.
lucene is a wonderful search engine: fast and it can index field based data! and we have one engine for ez publish and other data.
we also tried to use the php-implementation form zend framework. unfortunatly it has some restrictions concerning the query supported by this version. so we use the java version.
next week we will try to index 600.000 documents /products for a new project.
Frank
Hi Frank,
Our goal is first to make a search plugin based on lucene, did you also take that route? Would you share your work? Ours will finally be freely available :-))
Paul
Hi Paul,
I found LIUS
http://www.bibl.ulaval.ca/lius/index.en.html
which is based on lucene and can:
[...]The LIUS framework adds to Lucene many files format indexing fonctionalities as: Ms Word, Ms Excel, Ms PowerPoint, RTF, PDF, XML, HTML, TXT, Open Office suite and JavaBeans[...]
Greetings, ekke
Hi Ekke
I know about LIUS, but the file formats you mentioned are already supported within ez publish … except for powerpoint and OOo … javabeans don’t count with ez publish.
For those who will use the lucene plugin later, they will have a powerpoint an OOo indexer “for free” since we can lever the OIP librray (Java) to php by using the same php-jave bridge.
I will present this is part of my talk at the summer conference
Thanks Paul,
aah, I now understand your way of going, I had a differnt way in mind, I’m waiting for this talk, see you in Skien
greetings, ekke
If you have your own app based on Lucene and want to index MS WORD,EXCEL,POWERPOINT and PDF files, take a look at how it is done in Lire or Nutch. The book “Lucene in action” is also very good resource.
To have this support in eZ publish, it is done outside the Lucene extension with dedicated scripts extracting the text.
hth
Paul