Regex for Chinese symbols in rules

Use this forum if you have installed hMailServer and want to ask a question related to a production release of hMailServer. Before posting, please read the troubleshooting guide. A large part of all reported issues are already described in detail here.
Post Reply
evgeni.g
New user
New user
Posts: 3
Joined: 2012-11-01 14:21

Regex for Chinese symbols in rules

Post by evgeni.g » 2012-11-01 14:35

Hi,

I'm trying to block some spam for a domain. The spam is in Chinese - in the From, Subject and Body.
I've found a regex that should capture any Chinese symbols, but it is not working in hMailServer.
The pattern is [\u4E00-\u9FFF]+ and it works in a regex test tool I usually use.
I'm using hMailServer 5.3.3, build 1879 on a Win 2k8 R2 machine.
Could you please hint as to why it is not working or maybe another way to achieve this blocking rule.

Thanks in advance for any help.

User avatar
dzekas
Senior user
Senior user
Posts: 2486
Joined: 2005-10-13 21:28
Location: Lithuania

Re: Regex for Chinese symbols in rules

Post by dzekas » 2012-11-01 20:38

Emails are not in UTF-16 or UCS2 charsets. These charsets are not common in email world. Are you sure that regex library used by hmailserver is unicode aware and that software converts all mime encoded information to plain text in proper charset before feeding it to regexp.

evgeni.g
New user
New user
Posts: 3
Joined: 2012-11-01 14:21

Re: Regex for Chinese symbols in rules

Post by evgeni.g » 2012-11-02 10:19

Hi,

I'm not sure if hMailserver's regex library is unicode. I just asumed that since v5.x supports unicode it is true for all of its functionalities.
I'm also not sure if the mime information is converted.
If I did I probably would not have posted my question, since I would have known why it is not working.

Could you advise on another way to block the spam with Chinese other than the built in regex engine?

User avatar
dzekas
Senior user
Senior user
Posts: 2486
Joined: 2005-10-13 21:28
Location: Lithuania

Re: Regex for Chinese symbols in rules

Post by dzekas » 2012-11-02 19:24

evgeni.g wrote:Hi,

I'm not sure if hMailserver's regex library is unicode. I just asumed that since v5.x supports unicode it is true for all of its functionalities.
I'm also not sure if the mime information is converted.
If I did I probably would not have posted my question, since I would have known why it is not working.

Could you advise on another way to block the spam with Chinese other than the built in regex engine?
http://countries.nerd.dk/more.html Chinese country codes are CN and TW.

If emails are written in Chinese, they might use common Chinese character sets. euc-cn, big5, GB2312 or GB18030, if spammers don't start using utf-8.

Spamassassin has language detection modules.

evgeni.g
New user
New user
Posts: 3
Joined: 2012-11-01 14:21

Re: Regex for Chinese symbols in rules

Post by evgeni.g » 2012-11-07 11:23

Hi,

Blocking by country isn't really an option as it will block legitimate emails.
I only need to block emails with Chinese symbols. The legitimate email senders know they are communicating with UK company and avoid using those symbols.

I can not install spamassassin on the server and don't have access to another nor I can open a port in the external firewall if I had access.

As for the encoding it seems it is unicode, see example of part of header:

From: =?utf-8?B?5Y2X5Y2a572R?= <nbw@caexpo.com>

Subject: =?utf-8?B?5Lit5Zu94oCU5Lic55uf5q+P5pel57uP6LS45b+r6K6v56ysMTI2M+acnw==?=


I'll need to dig through the logs to see if that is how it originally looked like, this is already received message header.
I have to add that all messages I have seen use different domains for the sender, so domain blacklist wouldn't help.

Any other suggestions?

User avatar
dzekas
Senior user
Senior user
Posts: 2486
Joined: 2005-10-13 21:28
Location: Lithuania

Re: Regex for Chinese symbols in rules

Post by dzekas » 2012-11-07 22:04

evgeni.g wrote:I only need to block emails with Chinese symbols
...
Any other suggestions?
There is old SciFi novel about nine billion names of god. In reality monks failed, cause machine wrote those names only in one character set. If machine had to calculate names in all charsets, Mark5 won't be able to do it in ages.

Could you setup test filtering rule, which tests subject or sender for =?text?[BQqb]?text?= and check if filtering rule catches test 8bit emails. If it does and 7bit emails bypass the rule, then you will know that email server is not decoding headers before doing filtering.

If you insist on not installing tools that can deal with spam better than email sorting rules, you are free to pay for commercial spam filtering services.

Post Reply