Looking for a way to train SpamAssassin to catch spam better? Learn the tricks of the trade from the Support Engineers at Bobcares.
Unless you train your SpamAssassin, you will not be able to use all the features it offers to your advantage. Fortunately, the experts at Bobcares are here to help you out.
Training SpamAssassin to catch spam better
Although SpamAssassin comes with a couple of plugins enabled for SPF, DKIM, RBL as well as content checks, it is still limited unless you train the Bayesian filter. This filter compares past content from known ham and spam emails to determine whether it is spam or not.
Training SpamAssassin to filter emails with data you have collected is more effective when you have a large amount of ham and spam available. However, you can also use online databases to feed SpamAssassin’s database initially. After the training is complete, you will have to proceed with further training using your own data to make the filter more accurate.
Train SpamAssassin to catch spam better with sa-learn
sa-learn is handy in training SpamAssassin. By default, it takes a directory of ham and spam emails and adds the corresponding tokens to the database. The token is a sequence of short characters or words that are often found in ham or spam.
We would like to point out that the same user who starts spamc needs to run sa-learn inside the mail content filter.
While you can manually run sa-learn, you can also opt to add it to cron-job and make it a part of the database’s routine update. In addition, sa-learn ignores already processed emails to avoid adding additional weight to certain tokens.
For instance, here is an example of using sa-learn with Maildir format:
$ sa-learn --spam /path/to/spam/folder $ sa-learn --ham /path/to/ham/folder
You can also teach SpamAssassin from a single email or via mbx or mbox formats. For example:
$ sa-learn --spam /path/to/spam.email $ sa-learn --mbox /var/mail/user $ sa-learn --mbx /var/mbx/mail/test
SpamAssassin also handles wildcards inside the path. This comes in handy when updating the Bayes database for several users. The curly braces are also used to identify folder names as well.
$ sa-learn --spam /var/vmail/*/Maildir/Spam/{cur,new} $ sa-learn --ham /var/vmail/*/Maildir/cur
Initial Training data
Initially, you will not have a large database of spam and ham emails to train your SpamAssassin. However, you can access a few online databases or use SpamAssassin backups to get started.
Public spam data is helpful to train your SpamAssassin, but that may not always be the case. Our Support Engineers recommend training your SpamAssassin via incoming emails for better functioning. Here are a few of the commonly used public spam archives:
- ArtInvoice.hu Spam archive: It offers an initial database that can be restored.
- Untrouble.org Spam archive: This is an active spam archive that is still active and has spam since 1998.
- Old SpamAssassin data: This consists of old public corpus data dating from 2002 to 2005 from SpamAssassin.
View trained data
Run the following command to view trained ham and spam data:
sa-learn --dump [all|data|magic]
Although, the tokens will not be visible as they are hashed. The output of the above command will show you the number of ham and spam emails added to the database, the number of tokens, expiry options of tokens as well as when the journal was synced.
$ sa-learn --dump magic 0.000 0 4 0 non-token data: bayes db version 0.000 0 1441379 0 non-token data: nspam 0.000 0 516839 0 non-token data: nham 0.000 0 166735 0 non-token data: ntokens 0.000 0 1584031937 0 non-token data: oldest atime 0.000 0 1601720754 0 non-token data: newest atime 0.000 0 1601720758 0 non-token data: last journal sync atime 0.000 0 1601720761 0 non-token data: last expiry atime 0.000 0 0 0 non-token data: last expire atime delta 0.000 0 0 0 non-token data: last expire reduction count
In addition, you can view the encoded token data with the output formatted into five fields. The fields from left to right include:
- Probability that the token is spam.
- No. of spam email with token.
- No. of ham emails with token.
- The exact time when the token was last accessed during training.
- The token’s encoded version.
$ sa-learn --dump data 0.995 3 0 1575289038 008a1fb253 0.008 0 2 1607626751 03b82a68a9 0.993 2 0 1576254819 076aeef7fa 0.999 12 0 1574718455 0919d38b9c 0.987 1 0 1575615165 09cdb989a8 0.987 1 0 1574931501 0bedcc3ea2 0.016 0 1 1575120730 0e9e73e4fb 0.987 1 0 1576312660 10ae0462e8 1.000 277 0 1576750116 10f08309d0 0.008 0 2 1575368277 11e21177a3
Trained data storage
Based on the setup of your SpamAssassin, the data can be stored in either DBM, Berkeley DB, PostgreSQL, MySQL, Redis, or SDBM.
By default, trained data is stored in DBMS database file bayes_seen, bayes_journal, and bayes_toks. You can find these files in the folder .spamassassin in the user’s home directory. These files are handled by the sa-learn tool.
Backup & restore database
You can perfrom backup and restoration via the sa-learn command as seen below:
$ sa-learn --backup > /var/backup/spamassassin.bak $ sa-learn --restore /var/backup/spamassassin.bak
The backup will consist of seen emails and learned tokens in a single backed-up file.
Cron job & scripting
Increase the performance of the script by using –sync and –no-sync options with sa-learn. If you are using several sa-learn commands, writing it to the journal and syncing it with the database at the end offers better performance. For instance:
#!/usr/bin/env bash # Perform some repeatable logic for folder in /var/mail/*; do sa-learn --no-sync --spam "${folder}/spam" sa-learn --no-sync --ham "${folder}/ham" done sa-learn –sync
In case spam still seems to be getting through, verify that the same user is running spamc and sa-learn.
[Looking for help? Try our Server Management Services.]
Conclusion
Thus we saw how to train SpamAssassin to catch spam better from the experts at Bobcares. Our Support Engineers are well-versed with the different aspects of Server Management and more.
PREVENT YOUR SERVER FROM CRASHING!
Never again lose customers to poor server speed! Let us help you.
Our server experts will monitor & maintain your server 24/7 so that it remains lightning fast and secure.
0 Comments