[Bug 278424] deskutils/py-paperless-ngx: man page doesn't mention NLTK's Snowball Stemmer
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Thu, 18 Apr 2024 07:21:48 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=278424 Bug ID: 278424 Summary: deskutils/py-paperless-ngx: man page doesn't mention NLTK's Snowball Stemmer Product: Ports & Packages Version: Latest Hardware: Any OS: Any Status: New Severity: Affects Some People Priority: --- Component: Individual Port(s) Assignee: grembo@FreeBSD.org Reporter: freebsd.bugzilla@mail.tinsuke.com Flags: maintainer-feedback?(grembo@FreeBSD.org) Assignee: grembo@FreeBSD.org The man page states, about setting up NLTK: > NLTK DATA > In order to process scanned documents using machine learning, paperless- > ngx requires NLTK (natural language toolkit) data. The required files > can be downloaded by using these commands: > > /usr/local/bin/python3.9 -m nltk.downloader \ > stopwords punkt -d /var/db/paperless/nltkdata It is missing the "snowball_data" file to be downloaded. The file is referred to in the project's doc (https://docs.paperless-ngx.com/setup/#bare_metal): > Optional: If using the NLTK machine learning processing (see PAPERLESS_ENABLE_NLTK for details), download the NLTK data for the Snowball Stemmer, Stopwords and Punkt tokenizer to your PAPERLESS_DATA_DIR/nltk. Refer to the NLTK instructions for details on how to download the data. I can't vouch for how handy it is to have that in NLTK or not, but it sounds very useful from its description (https://github.com/snowballstem/snowball?tab=readme-ov-file#what-is-stemming): > What is Stemming? > Stemming maps different forms of the same word to a common "stem" - for example, the English stemmer maps connection, connections, connective, connected, and connecting to connect. So a search for connected would also find documents which only have the other forms. I suggest "snowball_data" is added to the man page's sample NLTK download command so it is in line with the project's docs and can be useful to users of this port (thanks for it, btw!). -- You are receiving this mail because: You are the assignee for the bug.