View previous topic :: View next topic |
Author |
Message |
Cowboy
Guest
|
Posted: Sat Nov 22, 2003 6:49 am Post subject: Need help with regex filter. |
|
|
I need a filter that will weed out comments placed in the middle of a word.
It would delete:
Buy my delicious sp<!kfgkh8899>am.
But it would not delete:
Buy my delicious <!kfgkh8899>spam.
I can not write the filter for myself, so if someone could help me with this it would do a lot for my spam filtering.
Thanks! |
|
Back to top |
|
|
stan_qaz
General
Premium Member
Joined: Mar 31, 2003
Posts: 4099
Location: USA
|
Posted: Sat Nov 22, 2003 1:09 pm Post subject: |
|
|
Go to the search function and select the firetrust catagory and search on html, you will find plenty of discussion of the topic and several suggestions for filters. |
|
Back to top |
|
|
Cowboy
Guest
|
Posted: Sat Nov 22, 2003 2:44 pm Post subject: |
|
|
There is no filter like I need. At least not that I can find.
The closest is the filter that counts the comments but I think that is not what I need. |
|
Back to top |
|
|
stan_qaz
General
Premium Member
Joined: Mar 31, 2003
Posts: 4099
Location: USA
|
Posted: Sat Nov 22, 2003 5:26 pm Post subject: |
|
|
That is as good as it is going to get, the problem isn't easy to solve as you saw from the posts you looked at.
Are you willing to pay to have a filter written? Make an offer and see if someone is willing to tackle the problem for some cash.
If not chip into the threads asking for a processed message option for the filters, the fix I like best. |
|
Back to top |
|
|
Cowboy
Guest
|
Posted: Sat Nov 22, 2003 7:59 pm Post subject: |
|
|
Nonsense. It can not be as good as it gets until someone tries to write the filter. Nobody has tried yet! |
|
Back to top |
|
|
denn988
Guest
|
Posted: Sat Nov 22, 2003 9:31 pm Post subject: |
|
|
Cowboy wrote: |
Nonsense. It can not be as good as it gets until someone tries to write the filter. Nobody has tried yet! |
So....Why don't you try???
(?# finds words broken by html comments )[a-z](<[!/].*?>)[a-z]
You might find it to be easier than you thought possible...
|
|
Back to top |
|
|
Ikeb
General
Premium Member
Joined: Apr 20, 2003
Posts: 3483
Location: Canada
|
Posted: Sun Nov 23, 2003 1:12 am Post subject: |
|
|
Cowboy wrote: |
Nonsense. It can not be as good as it gets until someone tries to write the filter. Nobody has tried yet! |
Ride 'em cowboy!
Shoot first, ask questions later!
_________________
I like SPAM ... on my sandwich!
|
|
Back to top |
|
|
Cowboy
Guest
|
Posted: Sun Nov 23, 2003 7:47 am Post subject: |
|
|
I have tried. I've read the help files. I've tried to put together the parameters to make such a filter. I've sat for hours trying everything I can think of. Never once did I get it to work. So I decided I needed help. All I got was a bunch of comments about as useful as my filter attempts.
In the best of worlds I would have gotten a "That's a good idea to filter out just html comments that are used to disguise words, instead of trying to count the comments. Here's your filter.", or I would get a "That's a bad idea because it's impossible to make such a filter." Or at least not a thread bogged down by the assume patrol. |
|
Back to top |
|
|
denn988
Guest
|
Posted: Sun Nov 23, 2003 8:29 am Post subject: |
|
|
As long as you have already tried....
See if this will help:
The body....
contains Regular Expr...
Quote: |
(?# words broken by html comments )[a-z]<[!/][^<]*?>[a-z] |
Anyone who wishes can make any improvments to the filter as they see fit. This is just the simplest version that would seem to work.
If you decide you have to AUTO-DELETE based on this, don't blame me if you loose a few legitimate mails.
|
|
Back to top |
|
|
denn988
Guest
|
Posted: Sun Nov 23, 2003 8:11 pm Post subject: |
|
|
Cowboy,
I have had a day to see how the above filter works and it looks pretty good so far.
There are a couple of mods that I have made to it that have improved its trap rate.
Change the above RegExp to:
Quote: |
(?# words broken by html comments )[a-z]<[^<]*?>[a-z]
|
I removed the '!/' from the filter, so it will trap any word that has the html brackets in between the letters.
Examples:
s<!tytyt>pam
sp<wretser>am
sp</font>am
are trapped
this </font>is a test
is NOT trapped.
This filter can still result in false positives, so don't auto-delete.
|
|
Back to top |
|
|
Ikeb
General
Premium Member
Joined: Apr 20, 2003
Posts: 3483
Location: Canada
|
Posted: Mon Nov 24, 2003 1:54 am Post subject: |
|
|
Denn988, thanks for another good one.
Do you think it's OK to have the filter fire on a single hit?
Also, instead of the [^<] negation, why not use [^>] since it's the closing ">" bracket that will follow this part of the match?
_________________
I like SPAM ... on my sandwich! |
|
Back to top |
|
|
denn988
Guest
|
Posted: Mon Nov 24, 2003 8:05 am Post subject: |
|
|
Ikeb wrote: |
Denn988, thanks for another good one.
Do you think it's OK to have the filter fire on a single hit?
Also, instead of the [^<] negation, why not use [^>] since it's the closing ">" bracket that will follow this part of the match? |
First...
I don't think it would be a good idea to write this type of filter to look for multiple hits. The reason is that is starts with a wildcard ([a-z]). If you were to write the filter to continue looking for more than one instance it would require a lot of CPU time to do each iteration, and with the 'a-z' at the beginning it would do it for each charactor in the message.
That would probably cause the filter to be more time intensive that you would consider acceptable.
Second...
As to the '[^<]' in the Regex...
It is there to prevent the filter from trapping a situation where there are two opening brackets prior to a closing bracket.
Example:
10<20<30
30>20>10
The above is NOT html, but represent two valid mathematical expressions.
You don't want the filter to trap on something like that.
Before you ask....
You could have another rule in the filter that looks for a "Content-Type: text/html"....but it would be something of a useless rule. There would be no easy way to write the filter so that it would only look at the message part that was HTML, in those cases that were multipart messages.
Anything that you would try to do with regex to try to do that would be even more CPU intensive than the 'multi-hit' filter would be.
|
|
Back to top |
|
|
Ikeb
General
Premium Member
Joined: Apr 20, 2003
Posts: 3483
Location: Canada
|
Posted: Mon Nov 24, 2003 9:45 am Post subject: |
|
|
denn988 wrote: |
I don't think it would be a good idea to write this type of filter to look for multiple hits. The reason is that is starts with a wildcard ([a-z]). If you were to write the filter to continue looking for more than one instance it would require a lot of CPU time to do each iteration, and with the 'a-z' at the beginning it would do it for each charactor in the message.
That would probably cause the filter to be more time intensive that you would consider acceptable.
Second...
As to the '[^<]' in the Regex...
It is there to prevent the filter from trapping a situation where there are two opening brackets prior to a closing bracket.
Example:
10<20<30
30>20>10
The above is NOT html, but represent two valid mathematical expressions.
You don't want the filter to trap on something like that. |
OK thanks for the clarification.
denn988 wrote: |
Before you ask....
You could have another rule in the filter that looks for a "Content-Type: text/html"....but it would be something of a useless rule. There would be no easy way to write the filter so that it would only look at the message part that was HTML, in those cases that were multipart messages.
Anything that you would try to do with regex to try to do that would be even more CPU intensive than the 'multi-hit' filter would be. |
You give me too much credit! I hadn't thought of attempting to check the html parts only. Besides I think the math expressions you gave as examples could also occur with HTML messages.
_________________
I like SPAM ... on my sandwich!
|
|
Back to top |
|
|
denn988
Guest
|
Posted: Mon Nov 24, 2003 10:13 am Post subject: |
|
|
Quote: |
You give me too much credit! I hadn't thought of attempting to check the html parts only. Besides I think the math expressions you gave as examples could also occur with HTML messages.
|
Those examples would look totally different if they appeared in an HTML part than they would in a Plain Text part.
Those examples, if sent as HTML, would appear in the raw text as:
10<20<30
and
30>20>10
The brackets must be sustituted when converting them to the HTML raw text in order to keep the translator from being confused.
|
|
Back to top |
|
|
Guest
|
Posted: Mon Nov 24, 2003 10:22 am Post subject: |
|
|
Sorry,
I forgot to turn th e HTML off when I posted
Those examples, if sent as HTML, would appear in the raw text as:
1 0 & l t ; 2 0 & l t ; 3 0
and
3 0 & g t ; 2 0 & g t ; 1 0
I had to place spaces between each charactor above to get them to post.
The brackets must be sustituted when converting them to the HTML raw text in order to keep the translator from being confused.[/quote] |
|
Back to top |
|
|
|