Full text search indexing

Use this forum if you want to suggest a new feature to hMailServer. Before posting, please search the forum to confirm that it has not already been suggested.
Post Reply
palinka
Senior user
Senior user
Posts: 4455
Joined: 2017-09-12 17:57

Full text search indexing

Post by palinka » 2018-04-18 14:27

A while back I asked if this was possible - index not only headers but also message body text - because searching large mailboxes is effectively impossible due to clients timing out before the search is completed. I didn't think about it for a while but I just ran across a plug-in for dovecot that does exactly this function.

I think it's a great (really almost necessary) utility in this age where people are used to having 10 year old, very large mailboxes on Gmail, etc that return search results immediately. Thank you for considering.

User avatar
Dravion
Senior user
Senior user
Posts: 2071
Joined: 2015-09-26 11:50
Location: Germany
Contact:

Re: Full text search indexing

Post by Dravion » 2018-04-18 16:06

Do you have a Link for the Dovecot PlugIn?

User avatar
mattg
Moderator
Moderator
Posts: 22435
Joined: 2007-06-14 05:12
Location: 'The Outback' Australia

Re: Full text search indexing

Post by mattg » 2018-04-18 16:15

RvdH's builds include the HTMLbody in an index search >> viewtopic.php?f=10&t=30193&start=60#p203420 #3
Just 'cause I link to a page and say little else doesn't mean I am not being nice.
https://www.hmailserver.com/documentation

palinka
Senior user
Senior user
Posts: 4455
Joined: 2017-09-12 17:57

Re: Full text search indexing

Post by palinka » 2018-04-18 16:35

Here is the link for the dovecot plugin: https://wiki.dovecot.org/Plugins/FTS

Matt, thanks! I will check it out right now.

palinka
Senior user
Senior user
Posts: 4455
Joined: 2017-09-12 17:57

Re: Full text search indexing

Post by palinka » 2018-04-19 00:52

Matt, I installed it and did some testing and I don't think this actually indexes the content of message bodies. I left a question on the thread you referenced.

palinka
Senior user
Senior user
Posts: 4455
Joined: 2017-09-12 17:57

Re: Full text search indexing

Post by palinka » 2018-04-19 12:38

RvdH responded to my question and confirmed it has nothing to do with indexing. So my feature request stands. Thanks for considering it.

By the way, the person that alerted me to the dovecot plugin also told me his searches went from 5 minutes to 1 second after installing the plugin. This is an incredible improvement and, I think, worthy of consideration as a new feature.

User avatar
Dravion
Senior user
Senior user
Posts: 2071
Joined: 2015-09-26 11:50
Location: Germany
Contact:

Re: Full text search indexing

Post by Dravion » 2018-12-04 20:27

I checked that thing.
In short words, its a commercial Addon for Dovecot Pro (its only avaiable for paying Customers) and doesnt do the Fulltext Search Indexing on its own, it uses Apache Lucene, Solr and Solr Server in the Background to buildup the index which improves the Imap search command (which uses -"Goold old Grep" - in most cases) - so no wonder its that fast.

Lucene and Solr/Solr Server are Free Software, so generally spoken - it could be done for hMailServer as well without any fees or restrictions like in the Dovecot Pro case.

palinka
Senior user
Senior user
Posts: 4455
Joined: 2017-09-12 17:57

Re: Full text search indexing

Post by palinka » 2018-12-05 14:29

Dravion wrote:
2018-12-04 20:27
I checked that thing.
In short words, its a commercial Addon for Dovecot Pro (its only avaiable for paying Customers) and doesnt do the Fulltext Search Indexing on its own, it uses Apache Lucene, Solr and Solr Server in the Background to buildup the index which improves the Imap search command (which uses -"Goold old Grep" - in most cases) - so no wonder its that fast.

Lucene and Solr/Solr Server are Free Software, so generally spoken - it could be done for hMailServer as well without any fees or restrictions like in the Dovecot Pro case.
Thank you for looking into it. It's just a tiny tiny bit outside my skill range. :lol:

So what (I think) I learned is that dovecot always uses external software for indexing, unlike hmail which has indexing of certain headers built in. Hmail searches either indexed headers or both indexed headers + eml file text query depending on the search command. I wonder if there is an hmail event that could usurp this order.

For example, on message delivery, scan message body, strip out html, insert results into a database with a link to the message. Then, upon search, force searching the database instead of hmail data folder.

Unfortunately my scripting skills are limited to cut n paste. :(

User avatar
Dravion
Senior user
Senior user
Posts: 2071
Joined: 2015-09-26 11:50
Location: Germany
Contact:

Re: Full text search indexing

Post by Dravion » 2018-12-07 22:54

Hey there.

I just was reading the Lucene and SOLR Manuals to find out how it works.It all comes down to two methods of doing it. However: FTS is a expert topic and so i looked into the MySQL, MariaDB, MS-SQL and PostgreSQL and even SQLLite Docs and i found out - all those DBs supporting Fulltext search allready.

This means, we can integrate FTS into hMailServer by just modifying our SQL-Scripts and we dont need extra ThirdpartyTools like Lucene or SOLR to get in done (the Dovecot PlugIn requires Lucene and SOLR but we dont) :)

User avatar
mattg
Moderator
Moderator
Posts: 22435
Joined: 2007-06-14 05:12
Location: 'The Outback' Australia

Re: Full text search indexing

Post by mattg » 2018-12-07 23:30

Except that full mail messages aren't stored in the database
They are stored in the file store, under the data directory.
Just 'cause I link to a page and say little else doesn't mean I am not being nice.
https://www.hmailserver.com/documentation

User avatar
Dravion
Senior user
Senior user
Posts: 2071
Joined: 2015-09-26 11:50
Location: Germany
Contact:

Re: Full text search indexing

Post by Dravion » 2018-12-08 00:52

mattg wrote:
2018-12-07 23:30
Except that full mail messages aren't stored in the database
They are stored in the file store, under the data directory.
Thats not a problerm at all.

The SQL-Databases support internal Data and external File data as well.
The important part is, to analze the Text in the EML File and build a SQL Database Index which can be accessed b SQL-Query.
The Problem is, any Database handles this type of non SQL-Standard Queries diffrent, so we need to abstract it in a way it fits hMailServer
DatabaseManagager Class. Could be done with a Storedprocedure and/ or Tigger combination.

The work is worth it, because searching in Mails in Thuderbird or a diffrent IMAP-Client in the Mailbody can take Minutes
which ist acceptable.

User avatar
Dravion
Senior user
Senior user
Posts: 2071
Joined: 2015-09-26 11:50
Location: Germany
Contact:

Re: Full text search indexing

Post by Dravion » 2018-12-08 01:43

There is also a diffrent bennefit for us if we get FTS working. The FTS-index will contain all searchable terms and even whole, searchable patterns.At this point the EML-File can be transpatently text compressed and dynamicly unpacked on Message rfetch via POP3 and IMAP and will still stay searchablle.

Text can be compressed by a factor of 3:1 or if you lucky up to 4:1

Pros:
FTS with text compression and Single Instant Attachement management we can reduce the Diskusage significantly and archive faster Thunderbird, Outlook or Roundcube ect. Searchspeeds, even with Million of Messages in a User or Shared-Inbox.

Cons
Transparent EML-File compression requires more CPU time

SQL-Databases getting a bit bigger because to buildup a Fulltext Searchindex a specific amount of EML Messagetext needs to be inserted.

One time preparation and EML decoding and rebuild
needs some extra CPU Time but can be a slow priority background task.

User avatar
mattg
Moderator
Moderator
Posts: 22435
Joined: 2007-06-14 05:12
Location: 'The Outback' Australia

Re: Full text search indexing

Post by mattg » 2018-12-08 05:05

Are we talking plain text only, or HTMLBody searches as well

This script puts a large chunk of each message into the database
http://www.hmailserver.com/forum/viewto ... 20&t=13890

This script adds the first million characters from oMessage.body in to the database, plus sender, subject, time and hmailserver filename - that seems like a good start to achieve both goals of text search and linked attachments.
Would need a way to process existing messages.

ALSO, need to remove messages from database when they are deleted. This delivery log script just keeps adding new detail.

Is the size of the message going to be a problem for indexing?
Is a million characters enough
Just 'cause I link to a page and say little else doesn't mean I am not being nice.
https://www.hmailserver.com/documentation

User avatar
Dravion
Senior user
Senior user
Posts: 2071
Joined: 2015-09-26 11:50
Location: Germany
Contact:

Re: Full text search indexing

Post by Dravion » 2018-12-08 07:59

mattg wrote:
2018-12-08 05:05
Are we talking plain text only, or HTMLBody searches as well
There is no diffrence, because HTML is plain text to.
All (plus binary Attachements like Images, PDF ect) will be encoded into the MIME Format and seen by Mailprograms as
Multipartmessage.

But it could be reasonable to indentify and remove all non Message specific text like <img> and <body> <html> before
using it as input for the FTS-Index buildup. It doesnt cause harm to the original Message on the one side and doesnt pollute
the Fulltextsearch Index with Keywords like <img> and <body> because no normal Thunderbird or Outlook or Roundcube
User would ever search for somethign like this. This should decrease the amount of data space, requiored to be stored in the
SQL-DB to build up a FTS-Index a bit to.
mattg wrote: This script puts a large chunk of each message into the database
http://www.hmailserver.com/forum/viewto ... 20&t=13890

This script adds the first million characters from oMessage.body in to the database, plus sender, subject, time and hmailserver filename - that seems like a good start to achieve both goals of text search and linked attachments.
Would need a way to process existing messages.

ALSO, need to remove messages from database when they are deleted. This delivery log script just keeps adding new detail.

Is the size of the message going to be a problem for indexing?
Is a million characters enough
Regarding the Script, i think this approach doesnt help because you need to get rid of attachements and unwanted, text/tag before you use it as input for a FTS-Index. It also works only for incomning Emails but in reality most users have allready a hugh DATA Folder and countless DB-Records and we need to take care of allready existing data as well. I think a nice little Program like the DataDirectorySyncronizer will do the Job just fine. No user Input is needed, the Program can run independently in the Background and calculates the FTS-Index on its own by scanning EML-Files.If nothing is left to do, it should exit by itself until a planned task restarts it again.

palinka
Senior user
Senior user
Posts: 4455
Joined: 2017-09-12 17:57

Re: Full text search indexing

Post by palinka » 2018-12-08 13:37

Dravion wrote:
2018-12-08 07:59
mattg wrote:
2018-12-08 05:05
Are we talking plain text only, or HTMLBody searches as well
There is no diffrence, because HTML is plain text to.
All (plus binary Attachements like Images, PDF ect) will be encoded into the MIME Format and seen by Mailprograms as
Multipartmessage.

But it could be reasonable to indentify and remove all non Message specific text like <img> and <body> <html> before
using it as input for the FTS-Index buildup. It doesnt cause harm to the original Message on the one side and doesnt pollute
the Fulltextsearch Index with Keywords like <img> and <body> because no normal Thunderbird or Outlook or Roundcube
User would ever search for somethign like this. This should decrease the amount of data space, requiored to be stored in the
SQL-DB to build up a FTS-Index a bit to.
This exactly. ^^^

After stripping out html tags and attachments, most message bodies contain probably 1000 characters or less of "actual text" which is the thing that people read and search for. It's probably not even that impactful on db performance.

User avatar
mattg
Moderator
Moderator
Posts: 22435
Joined: 2007-06-14 05:12
Location: 'The Outback' Australia

Re: Full text search indexing

Post by mattg » 2018-12-09 01:14

Dravion wrote:
2018-12-08 07:59
Regarding the Script, i think this approach doesnt help because you need to get rid of attachements and unwanted, text/tag before you use it as input for a FTS-Index.
That 'Database Delivery Log', definitely only adds the oMessage.body text to the database.

No attachments, no HTML tags.
And I personally hit the 1 million characters a few times when I used that script.
I had to modify the script slightly to stop the errors that were being caused when the million characters limit was hit.

(I automate the sending of logs from client servers. These logs are parsed by my hmailserver for various keywords, and are treated differently if keywords are found. For instance I have an errors@example.com address where all significant errors are sent from various machines)
mattg wrote:
2018-12-08 05:05
that seems like a good start to achieve both goals of text search and linked attachments.
Would need a way to process existing messages.

ALSO, need to remove messages from database when they are deleted. This delivery log script just keeps adding new detail.
Just 'cause I link to a page and say little else doesn't mean I am not being nice.
https://www.hmailserver.com/documentation

mikernet
Normal user
Normal user
Posts: 62
Joined: 2018-09-04 22:22

Re: Full text search indexing

Post by mikernet » 2020-07-09 15:33

We really need to get some reasonable full-text search capabilities into hMailServer. Everyone and their mom has instant FTS in their email now. Every mobile email client relies on the server to do full-text searches when you search email which makes searching email on mobile impossible. This has become a big pain point for us right now.

The situation could be improved significantly with minimal changes if hMailServer streamed the search results as it got hits instead of waiting until the full search is done and then dumping the entire result at the end. The emails should be searched by date, newest first.

Thoughts?

Post Reply