[Bug 278424] deskutils/py-paperless-ngx: man page doesn't mention NLTK's Snowball Stemmer

From: <bugzilla-noreply_at_freebsd.org>
Date: Thu, 18 Apr 2024 07:21:48 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=278424

            Bug ID: 278424
           Summary: deskutils/py-paperless-ngx: man page doesn't mention
                    NLTK's Snowball Stemmer
           Product: Ports & Packages
           Version: Latest
          Hardware: Any
                OS: Any
            Status: New
          Severity: Affects Some People
          Priority: ---
         Component: Individual Port(s)
          Assignee: grembo@FreeBSD.org
          Reporter: freebsd.bugzilla@mail.tinsuke.com
             Flags: maintainer-feedback?(grembo@FreeBSD.org)
          Assignee: grembo@FreeBSD.org

The man page states, about setting up NLTK:

> NLTK DATA
>     In order to process scanned documents using machine learning, paperless-
>     ngx requires NLTK (natural language toolkit) data.  The required files
>     can be downloaded by using these commands:
>
>           /usr/local/bin/python3.9 -m nltk.downloader \
>             stopwords punkt -d /var/db/paperless/nltkdata

It is missing the "snowball_data" file to be downloaded. The file is referred
to in the project's doc (https://docs.paperless-ngx.com/setup/#bare_metal):

> Optional: If using the NLTK machine learning processing (see PAPERLESS_ENABLE_NLTK for details), download the NLTK data for the Snowball Stemmer, Stopwords and Punkt tokenizer to your PAPERLESS_DATA_DIR/nltk. Refer to the NLTK instructions for details on how to download the data.

I can't vouch for how handy it is to have that in NLTK or not, but it sounds
very useful from its description
(https://github.com/snowballstem/snowball?tab=readme-ov-file#what-is-stemming):

> What is Stemming?
> Stemming maps different forms of the same word to a common "stem" - for example, the English stemmer maps connection, connections, connective, connected, and connecting to connect. So a search for connected would also find documents which only have the other forms.

I suggest "snowball_data" is added to the man page's sample NLTK download
command so it is in line with the project's docs and can be useful to users of
this port (thanks for it, btw!).

-- 
You are receiving this mail because:
You are the assignee for the bug.