Bobcares

How to train SpamAssassin to catch spam better

by | Sep 26, 2021

Looking for a way to train SpamAssassin to catch spam better? Learn the tricks of the trade from the Support Engineers at Bobcares.

Unless you train your SpamAssassin, you will not be able to use all the features it offers to your advantage. Fortunately, the experts at Bobcares are here to help you out.

Training SpamAssassin to catch spam better

Although SpamAssassin comes with a couple of plugins enabled for SPF, DKIM, RBL as well as content checks, it is still limited unless you train the Bayesian filter. This filter compares past content from known ham and spam emails to determine whether it is spam or not.

Training SpamAssassin to filter emails with data you have collected is more effective when you have a  large amount of ham and spam available. However, you can also use online databases to feed SpamAssassin’s database initially. After the training is complete, you will have to proceed with further training using your own data to make the filter more accurate.

Train SpamAssassin to catch spam better with sa-learn

sa-learn is handy in training SpamAssassin. By default, it takes a directory of ham and spam emails and adds the corresponding tokens to the database. The token is a sequence of short characters or words that are often found in ham or spam.

We would like to point out that the same user who starts spamc needs to run sa-learn inside the mail content filter.

While you can manually run sa-learn, you can also opt to add it to cron-job and make it a part of the database’s routine update. In addition, sa-learn ignores already processed emails to avoid adding additional weight to certain tokens.

For instance, here is an example of using sa-learn with Maildir format:

$ sa-learn --spam /path/to/spam/folder
$ sa-learn --ham /path/to/ham/folder

You can also teach SpamAssassin from a single email or via mbx or mbox formats. For example:

$ sa-learn --spam /path/to/spam.email
$ sa-learn --mbox /var/mail/user
$ sa-learn --mbx /var/mbx/mail/test

SpamAssassin also handles wildcards inside the path. This comes in handy when updating the Bayes database for several users. The curly braces are also used to identify folder names as well.

$ sa-learn --spam /var/vmail/*/Maildir/Spam/{cur,new}
$ sa-learn --ham /var/vmail/*/Maildir/cur

Initial Training data

Initially, you will not have a large database of spam and ham emails to train your SpamAssassin. However, you can access a few online databases or use SpamAssassin backups to get started.

Public spam data is helpful to train your SpamAssassin, but that may not always be the case. Our Support Engineers recommend training your SpamAssassin via incoming emails for better functioning. Here are a few of the commonly used public spam archives:

  • ArtInvoice.hu Spam archive: It offers an initial database that can be restored.
  • Untrouble.org Spam archive: This is an active spam archive that is still active and has spam since 1998.
  • Old SpamAssassin data: This consists of old public corpus data dating from 2002 to 2005 from SpamAssassin.

View trained data

Run the following command to view trained ham and spam data:

sa-learn --dump [all|data|magic]

Although, the tokens will not be visible as they are hashed. The output of the above command will show you the number of ham and spam emails added to the database, the number of tokens, expiry options of tokens as well as when the journal was synced.

$ sa-learn --dump magic
0.000        0            4        0  non-token data: bayes db version
0.000        0      1441379        0  non-token data: nspam
0.000        0       516839        0  non-token data: nham
0.000        0       166735        0  non-token data: ntokens
0.000        0   1584031937        0  non-token data: oldest atime
0.000        0   1601720754        0  non-token data: newest atime
0.000        0   1601720758        0  non-token data: last journal sync atime
0.000        0   1601720761        0  non-token data: last expiry atime
0.000        0            0        0  non-token data: last expire atime delta
0.000        0            0        0  non-token data: last expire reduction count

In addition, you can view the encoded token data with the output formatted into five fields. The fields from left to right include:

  • Probability that the token is spam.
  • No. of spam email with token.
  • No. of ham emails with token.
  • The exact time when the token was last accessed during training.
  • The token’s encoded version.
$ sa-learn --dump data
0.995          3          0 1575289038  008a1fb253
0.008          0          2 1607626751  03b82a68a9
0.993          2          0 1576254819  076aeef7fa
0.999         12          0 1574718455  0919d38b9c
0.987          1          0 1575615165  09cdb989a8
0.987          1          0 1574931501  0bedcc3ea2
0.016          0          1 1575120730  0e9e73e4fb
0.987          1          0 1576312660  10ae0462e8
1.000        277          0 1576750116  10f08309d0
0.008          0          2 1575368277  11e21177a3

Trained data storage

Based on the setup of your SpamAssassin, the data can be stored in either DBM, Berkeley DB, PostgreSQL, MySQL, Redis, or SDBM.
By default, trained data is stored in DBMS database file bayes_seen, bayes_journal, and bayes_toks. You can find these files in the folder .spamassassin in the user’s home directory. These files are handled by the sa-learn tool.

Training SpamAssassin to filter emails against your own data i

Backup & restore database

You can perfrom backup and restoration via the sa-learn command as seen below:

$ sa-learn --backup > /var/backup/spamassassin.bak
$ sa-learn --restore /var/backup/spamassassin.bak

The backup will consist of seen emails and learned tokens in a single backed-up file.

Cron job & scripting

Increase the performance of the script by using –sync and –no-sync options with sa-learn. If you are using several sa-learn commands, writing it to the journal and syncing it with the database at the end offers better performance. For instance:

#!/usr/bin/env bash
# Perform some repeatable logic
for folder in /var/mail/*; do
    sa-learn --no-sync --spam "${folder}/spam"
    sa-learn --no-sync --ham "${folder}/ham"
done
sa-learn –sync

In case spam still seems to be getting through, verify that the same user is running spamc and sa-learn.

[Looking for help? Try our Server Management Services.]

Conclusion

Thus we saw how to train SpamAssassin to catch spam better from the experts at Bobcares. Our Support Engineers are well-versed with the different aspects of Server Management and more.

PREVENT YOUR SERVER FROM CRASHING!

Never again lose customers to poor server speed! Let us help you.

Our server experts will monitor & maintain your server 24/7 so that it remains lightning fast and secure.

GET STARTED

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *

Never again lose customers to poor
server speed! Let us help you.

Privacy Preference Center

Necessary

Necessary cookies help make a website usable by enabling basic functions like page navigation and access to secure areas of the website. The website cannot function properly without these cookies.

PHPSESSID - Preserves user session state across page requests.

gdpr[consent_types] - Used to store user consents.

gdpr[allowed_cookies] - Used to store user allowed cookies.

PHPSESSID, gdpr[consent_types], gdpr[allowed_cookies]
PHPSESSID
WHMCSpKDlPzh2chML

Statistics

Statistic cookies help website owners to understand how visitors interact with websites by collecting and reporting information anonymously.

_ga - Preserves user session state across page requests.

_gat - Used by Google Analytics to throttle request rate

_gid - Registers a unique ID that is used to generate statistical data on how you use the website.

smartlookCookie - Used to collect user device and location information of the site visitors to improve the websites User Experience.

_ga, _gat, _gid
_ga, _gat, _gid
smartlookCookie
_clck, _clsk, CLID, ANONCHK, MR, MUID, SM

Marketing

Marketing cookies are used to track visitors across websites. The intention is to display ads that are relevant and engaging for the individual user and thereby more valuable for publishers and third party advertisers.

IDE - Used by Google DoubleClick to register and report the website user's actions after viewing or clicking one of the advertiser's ads with the purpose of measuring the efficacy of an ad and to present targeted ads to the user.

test_cookie - Used to check if the user's browser supports cookies.

1P_JAR - Google cookie. These cookies are used to collect website statistics and track conversion rates.

NID - Registers a unique ID that identifies a returning user's device. The ID is used for serving ads that are most relevant to the user.

DV - Google ad personalisation

_reb2bgeo - The visitor's geographical location

_reb2bloaded - Whether or not the script loaded for the visitor

_reb2bref - The referring URL for the visit

_reb2bsessionID - The visitor's RB2B session ID

_reb2buid - The visitor's RB2B user ID

IDE, test_cookie, 1P_JAR, NID, DV, NID
IDE, test_cookie
1P_JAR, NID, DV
NID
hblid
_reb2bgeo, _reb2bloaded, _reb2bref, _reb2bsessionID, _reb2buid

Security

These are essential site cookies, used by the google reCAPTCHA. These cookies use an unique identifier to verify if a visitor is human or a bot.

SID, APISID, HSID, NID, PREF
SID, APISID, HSID, NID, PREF