Filter out bot visits from Gunicorn log

Our web is often written in Python, and to run the web app on production, we often use Gunicorn. Its log is also a resource for incident investigation. But the log of bot visits is so noisy. How to exclude them?

When running Gunicorn, we often have a config file for Gunicorn. We often name it gunicorn_conf.py, with content like this:

proc_name = 'awesome-web'
workers = 6
worker_tmp_dir = '/dev/shm/'

# Make short log line. Some info is discarded, because it is shown by journalctl already.
logconfig_dict = {
    'formatters': {
        'generic': {
            'format': '[%(levelname)s] %(message)s',
        }
    },
    'loggers': {
        'gunicorn.error': {
            'level': 'INFO',
            'handlers': ['error_console'],
            'propagate': False,
            'qualname': 'gunicorn.error',
        },
        'gunicorn.access': {
            'level': 'INFO',
            'handlers': ['console'],
            'propagate': False,
            'qualname': 'gunicorn.access',
        },
    },
}

To tell Nginx not to log visits of bots, we will manipulate Gunicorn logger object. First, define a function to identify bots (search bots and crawling bots) and a logger filter class:

import logging
from logging import LogRecord


def is_bot(user_agent: str):
    bot_ids = (
        'SemrushBot',
        'DataForSeoBot',
        'bingbot',
        'YandexBot',
        'AhrefsBot',
        'DotBot',
        'PetalBot',
        'EzLynx',
        'Googlebot',
        'Amazonbot',
        'MJ12bot',
        'Sogou web spider',
    )
    return any(s in user_agent for s in bot_ids)


class BotIgnoreFilter(logging.Filter):
    def filter(self, record: LogRecord) -> bool:
        passed = super().filter(record)
        # Ref: https://docs.gunicorn.org/en/stable/settings.html#access-log-format
        user_agent = record.args['a']
        from_bot = is_bot(user_agent)
        return passed and not from_bot

Then, we inject the code of setting logger into Gunicorn's on_starting hook:

def on_starting(server):
    server.log.access_log.addFilter(BotIgnoreFilter())

Done. If you let Gunicorn controlled by systemd, you can use systemctl to tell Gunicorn to re-read new config (given that your systemd unit file is our-web.service):

$ sudo systemctl reload our-web.service

Gunicorn's way of using Python script for configuration looks weird as first. But in some situation, like this case, it is an advantage.