ADMINISTRATOR - November 2024 - Apario Reader Configuration
Running your own instance of Apario Reader can be challenging when you don't know how to configure the application properly. This guide will help you win, and win bigly.
In January 2024, Project Apario released on GitHub, the Apario Reader OpenSource Go Web Application. Packaged in an all-in-one application that can be setup in minutes, the Apario Reader provides users advanced configurations that alter the way the application runs. With limited resources at our disposal, assembling the required documentation for the project is on the honey-do list, however its still an area that needs a little attention; and so I would like to invite you the opportunity now to learn about the higher levels of the Apario Reader configuration and what the scope of the project offers to users who wish to run their own 5D OSINT.
The Configurable Package
Created on June 8th, 2023; the configurable package on GitHub is written in Go, that has 5 dependencies, which runs on Go 1.23 and above. Stored inside the config.go file within the Apario Reader, is the implementation of the configurable package.
Within the Go ecosystem, there are many amazing and fantastic OpenSource packages available to complete various tasks needed; however the act of establishing configuration parameters felt too cumbersome across the marketplace; and so I wanted to offer something easier to use, better in my opinion, and highly extensible. Other packages focus on other priorities, and the configurable package is no different.
Within Go, a fundamental concept called interfaces is used to provide a contract to end-users of the configurable package. The primary interface in configurable is called IConfigurable. The New()
func returns an IConfigurable
interface type that contains a handful of functions to it.
IConfigurable Interface |
---|
Int(name string) *int |
NewInt(name string, value int, usage string) *int |
Int64(name string) *int64 |
NewInt64(name string, value int64, usage string) *int64 |
Float64(name string) *float64 |
NewFloat64(name string, value float64, usage string) *float64 |
String(name string) *string |
NewString(name, value, usage string) *string |
Bool(name string) *bool |
NewBool(name string, value bool, usage string) *bool |
Duration(name string) *time.Duration |
NewDuration(name string, value time.Duration, usage string) *time.Duration |
List(name string) *[]string |
NewList(name string, value []string, usage string) *[]string |
Map(name string) *map[string]string |
NewMap(name string, value map[string]string, usage string) *map[string]string |
LoadFile(filename string) error |
Parse(filename string) error |
Usage() string |
And when using it:
package main
import (
"fmt"
"github.com/andreimerlescu/configurable"
)
func main() {
config := configurable.New()
debug := config.NewBool("debug", false, "Enable debug mode")
err := config.Parse("config.yaml")
if err != nil {
panic(err)
}
fmt.Println("Debug mode:", *debug)
}
Go Playground: https://go.dev/play/p/YefJDFtlAPR
This package is used within the Apario Reader and it provides the capabilities to take variables, such as the example debug
above, and dereference them with *debug
to retrieve the value of the configurable. The Parse()
func accepts the name of a file that you wish to allow to load into your configurations. Assigning values to your configurable variables can be done using multiple techniques. These techniques are provided to you when you create new configurable variables using the IConfigurable methods NewInt
, NewInt64
, NewFloat64
, NewString
, NewBool
, NewDuration
, NewList
, and NewMap
. The arguments for these functions are all the same; which make them easy to remember and use in the future. The type of the New
follows in the function name itself, and the first argument is always a string
, but the 2nd argument is the same type as the func's name suggests; indicating that NewFloat64
requires argument 2 to be float64
. Argument three is the description of the flag.
Within each of the New
methods is a special set of logic that takes the first argument; the name of the configurable, and allows it to be found via Environment Variables or by using --<the name here>
in the CLI when you run go run .
for your project. This means; for the example debug
argument assigned to the variable debug
; the user of the application you're building could run go run . --debug true
or DEBUG=true go run .
or export DEBUG=true; go run .
. Either are available to you; and if you're leveraging .Parse(filePath)
then you can create either a .ini
, .json
, or .yaml
file. So; in the example, config.yaml
would look something like:
---
debug: true
Then, when the application is executed again, go run .
; the variable debug
gets the value true
assigned to it, because its been defined in the config.yaml
file and loaded into the runtime of the application using .Parse(filePath)
.
The package configurable also allows you to mix and match definitions of various arguments. For example, if you're using debug
, then you may not want to encourage or document the --debug
option everywhere; but rather allow users to establish system level globals using environment variables instead. This means that within your configuration file, you can have some parts of the configuration that really come from environment variables, some that come from encouraged CLI arguments directly, and others that are more fixed for long-term use that get stored inside configuration files.
Apario Reader Configuration Groups
With 117+ configurables offered within the Apario Reader application, there are 6 categories of configurations that controls the runtime.
- Global Settings
- Database Settings
- Appliance Settings
- Encryption Settings
- Advanced Search Settings
- Application Settings
Global Settings
--product-name ""
or PRODUCT_NAME=""
Default: apario-reader
--environment
or ENVIRONMENT=
Default: development
When you're reading to run a live instance, set this to production
.
--production-environment-label
or PRODUCTION_ENVIRONMENT_LABEL=
Default: production
When used, this can override what the --environment
label for true production to reflect; when ignored; production
is the tag required for --environment
to set the release mode of the application to production mode. When --production-environment-label="staging"
then the use of --environment="production"
will mean that the host will not be considered in production mode; but if you set --production-environment-label="honey"
and --environment="honey"
then the runtime of the application would be production mode tagged as honey
.
--info-log
or INFO_LOG=
Default: <nothing>
When used, this is the path to the primary application's log - not the web traffic log. This file should be subject to log rotation as it can become a single point of failure on an instance of apario-reader
that is left running for months at a time unattended.
--config
or CONFIG=
Default: config.yaml
When used, this is the filename of the *.yaml
config file that the application depends on for its runtime information when the full set of command line arguments are not used in conjunction with the apario-reader
binary.
--enable-ping
or ENABLE_PING=
Default: false
When set to true, a new route will be exposed as GET, POST, PUT, UPDATE, PATCH, PUSH, DELETE, (any HTTP verb) as /ping
will respond with PONG
.
--site-title
or SITE_TITLE=
Default: idoread.com
This is the name of the website that will appear all throughout the interface of Apario Reader.
--site-company
or SITE_COMPANY=
Default: Project Apario LLC
This is the name of the organization or company that is operating the instance of Apario Reader for the world. Put your name or group here. If you're a podcast making an instance of apario-reader
for your community, enter your podcast name here.
--primary-domain
or PRIMARY_DOMAIN=
Default: <undefined>
When used, this will assign the domain name of the instance to whatever value you place here. When configuring HTTPS, you'll need to ensure that the Certificate is issued for this domain name.
--decimal-symbol
or DECIMAL_SYMBOL=
Default: .
When defined, this allows you to override the locale settings of the instance of apario-reader
to replace .
decimal places for numbers with whatever value you place here as --decimal-symbol=","
so that $10.99 becomes $10,99 when rendered.
Database Settings
--database
or DATABASE=
Default: <undefined>
This is the path to your apario-writer
output destination. This should be a simple directory like /apario/database
where the subdirectories inside are the SHA512 checksum of the source URL of the document. Recommended to have this on its own drive, disconnected from the drive that can write to the logs. This drive will most likely be very large and serving the content you require. Keep in mind the speed of your disks too. For SAS drives connected over the network, you'll consistently see 500MB/s throughput; for SSD drives you'll see closer to 600MB/s throughput and for NVMe drives you'll see closer to 1700MB/s throughput. For 5400 RPM HDD drives (non-enterprise grade), you'll see 120MB/s throughput and a life expectancy out of the drive to be half of what you'd experience with an enterprise grade HDD. For a 7200 RPM enterprise grade HDD, you'll see performance around 220MB/s throughput. Given that thumbnails are 1/8 MB; 220MB/s can throughput up to 1760 thumbnails/second
. Given that full size pages - through StumbleInto are around 7/8 MB each, 250 StumbleInto/second
. For NVMe drives, you're looking at 13,600 thumbnails/second
throughput or 1,942 StumbleInto/second
.
--persist-runtime-database
or PERSIST_RUNTIME_DATABASE=
Default: true
When true, this will create an app.db
directory of .json
files that contain the in-memory data required for full-text search. This offloads the initial boot time of the application by building the index when this is set to true
but will not load this database. This flag saves the index database to disk so it can be loaded using the next configurable.
--load-persistent-database
or LOAD_PERSISTENT_DATABASE=
Default: true
When true, this will read the app.db
if its populated and load it into memory. When combined with --persist-runtime-database
, this speeds up boot time of the application considerably.
--flush-database-watch-file
or FLUSH_DATABASE_WATCH_FILE=
Default: <undefined>
When defined, the application will watch the path specified in this configurable to watch. Therefore, when this file is created on the system, and the goroutine watching this path, it will trigger the database to flush itself and rebuild its index. This will produce a resource intensive task that will place users going into your site in a waiting room until the migration has completed.
--users-database-path
or USERS_DATABASE_PATH=
Default: <undefined>
When defined, a directory called users.db
is created that stores *.json
files for all user accounts on your instance. Their cached Apario Identity public profile will be here. All users who connect to your instance are here.
--snippets-database-path
or SNIPPETS_DATABASE_PATH=
Default: <undefined>
When defined, a directory called snippets.db
is created that stores *.json
files alongside rendered assets for all snippets created on the instance. Snippets metadata is submitted to SnipThat.info but snippet data itself is stored on the instance of apario-reader
hosting that OSINT.
--tag-database-path
or TAG_DATABASE_PATH=
Default: <undefined>
When defined, a directory called tags.db
is created that stores *.json
files for tags. Each tag is a file. Each tag contains metadata and properties to them that are visible throughout the apario-reader
application. The tag database path is where Gematria is stored for tags and tag-relationships are established. A tag is not just a single word; its a word and a type combined together to mean something entirely new. Therefore, tags have relationships to tags and tags are one way to find the discoverability of content. Tagging a document requires Reputation. Tagged documents are stored in this database directory.
--database-concurrent-write-semaphore
or DATABASE_CONCURRENT_WRITE_SEMAPHORE
Default: 1
When defined, a semaphore will be used to limit the number of concurrent write operations permitted on the apario-writer
database database.db
. This isn't the app.db
or users.db
or tags.db
or snippets.db
; its the OSINT database itself. This semaphore controls how documents are changed through the Proposal system and what level of control you want to have over your instance. For instances that I maintain, I set this to 1 so I know for sure that nothing is changing the database en-masse.
--textee-database
or TEXTEE_DATABASE=
Default: <undefined>
When defined, a directory called textee.db
will be created that will keep track of all substring v. gematria index results that are used for partial full-text search.
Appliance Settings
These settings change the "appliance" state of the machine running apario-reader
. Since the application runs on bare metal hardware and is designed to use 100% of the system it has available to it, it's called an appliance and not an application. For that reason, these settings have a broader impact on the host machine running apario-reader
than the clients connecting to your instance reading OSINT.
Encryption Settings
--tls-public-key
or TLS_PUBLIC_KEY=
Default: ""
Expected to be A PATH to a PEM formatted SSL certificate and CA bundle, the TLS Public Key configurable is a string type variable that requires read permissions on the path and the file to contain the Certificate + CA Bundle concatenated together.
--tls-private-key
or TLS_PRIVATE_KEY=
Default: ""
Expected to be A PATH to a PEM formatted SSL private key, the TLS Private Key configurable is a string type variable that requires read permissions on the path and the file to contain the matching private key to the --tls-public-key
configurable.
--tls-private-key-password
or TLS_PRIVATE_KEY_PASSWORD=
Default: ""
The decryption password for the TLS private key (if applicable).
--auto-tls
or AUTO_TLS=
Default: false
When set to true
, apario-reader
will issue a self-signed SSL certificate to bind your instance to HTTPS connections.
--tls-life-min
or TLS_LIFE_MIN=
Default: ""
When defined, a duration in minutes set as an integer defines the life of the TLS certificate validity in memory before being re-checked.
--tls-expires-in
or TLS_EXPIRES_IN=
Default: ""
When defined, this will be the expiration of the TLS certificate that is issued in a self-signed manner.
--tls-company
or TLS_COMPANY=
Default: ""
When defined, this is the name that appears in the SSL Certificate details.
--tls-san-ip
or TLS_SAN_IP=
Default: ""
When defined, the TLS Certificate created for the self-signed HTTPS experience on the apario-reader
; the generated SSL Certificate can have SAN IP addresses attached to it, so the certificate can be used elsewhere throughout your services. IP bound SSL certificates offer more security, and when using the --auto-tls
functionality within apario-reader
, setting the --tls-san-ip
to the IP address of the host that is running the instance (you can run curl -L https://ip4.washere.dev
to get the IPv4 address of your WAN); and defining this IP address here allows the bind(":443")
to attach only to the IP of your apario-reader
instance. This protects your users and yourself from potential issues caused by using self signed certificates.
--tls-additional-domains
or TLS_ADDITIONAL_DOMAINS=
Default: ""
When defined, additional domain names will be secured using the --auto-tls
so that when you have mydomain1.com DNS A
record pointing to apario-reader
, and another mydomain2.com DNS A
record pointing to the same instance of apario-reader
will bind their SSL connections to the mydomain1.com or mydomain2.com and using multi-domain SSL certificates allows you to bind your single apario-reader
process into mydomain1.com
/ mydomain2.com
serving your OSINT content. Setting these additional domains allows you to use a self-signed certificate and define these domain names that you wish to use. If your users are okay with accepting the risks associated with self-signed certificates, you can secure your instance of Apario Reader with multiple IP addresses, multiple domains and do it in a way that you control the expiration.
--session-secret
or SESSION_SECRET=
Default: ""
When defined, sessions will receive their native encryption using this secret; otherwise a random string will be used and this can cause issues in production between deployments; setting this value assures that between deployments of apario-reader
, connected sessions retain their data by means of their ability to decrypt their active session in-between apario-reader
process restarts.
--session-authenticity-token-secret
or SESSION_AUTHENTICITY_TOKEN_SECRET=
Default: ""
When defined, the csrf
authenticity token is cryptographically generated using this secret; thus when you deploy instances of apario-reader
in the wild and are upgrading your instance, the csrf
token will remain valid between instance restarts; otherwise, this value assumes that functionality.
Advanced Search Settings
--concurrent-searches
or CONCURRENT_SEARCHES=
Default: 30
Advanced Search is built into all instances of apario-reader
and they are considered resource-intensive tasks. As such, you need to define a limit of the number of concurrent searches that you wish for apario-reader
to perform concurrently. When this limit is reached, the users connection to the page hangs until space opens up for the search for them. When new users enter the full search, they are presented with the waiting room offering them a place in line.
--search-algorithm
or SEARCH_ALGORITHM=
Default: jaro_winkler
I won't search for things you can search for yourself, but Jaro Winkler is an algorithm that is enabled by default that is used to perform searches. Alternative algorithms include jaro
, jaro_winkler
, wagner_fisher
, ukkonen
and hamming
. Each of these algorithms will slightly tweak the results of Advanced Search since they are the means that apario-reader
uses to correct for the 71% accuracy of the OSINT OCR (optical character recognition) output from Tesseract. By enabling these algorithms, matches for kenedy
will match kennedy
under these algorithms; thus allowing for OCR errors and user-errors with a 20% grace given to you.
--search-concurrent-buffer
or SEARCH_CONCURRENT_BUFFER=
Default: 369
These are concurrency channels used for processing search results. Tune this if your server is slow or you want to increase throughput since you're not seeing much utilization on your box despite users on it.
--search-concurrency-limiter
or SEARCH_CONCURRENCY_LIMITER=
Default: 9
This is also used for concurrency channels to process search results. Again, tuning this number helps you. This number isn't for the buffer of results, its for the concurrency limiter of the results processor.
--search-timeout-seconds
or SEARCH_TIMEOUT_SECONDS=
Default: 30
(seconds)
How many seconds you allow Advanced Search to run for before terminating the connection request to the user and delivering the results as-is to them.
Search Algorithm: Jaro
--search-threshold-jaro
or SEARCH_THRESHOLD_JARO=
Default: 0.71
This threshold value is used to adjust the Jaro algorithm only. A 0.71 value represents a comfortable balance of % of characters that must be valid to match. Too high of a value means lowering the grace allowed for the OCR + user errors/typos in their search keywords. To low of a value and you see matches for apple
and car
because they both contain an a
in them.
Search Algorithm: Jaro Winkler
--search-threshold-jaro-winkler
or SEARCH_THRESHOLD_JARO_WINKLER=
Default: 0.71
--search-jaro-winkler-boost-threshold
or SEARCH_JARO_WINKLER_BOOST_THRESHOLD=
Default: 0.7
--search-jaro-winkler-prefix-size
or SEARCH_JARO_WINKLER_PREFIX_SIZE=
Default: 3
The documentation outlines configurable parameters for three key search algorithms: Ukkonen, Wagner Fischer, and Hamming. Each algorithm provides options to adjust operation costs and constraints, tailored to optimize performance or accuracy based on specific use cases. For both Ukkonen and Wagner Fischer, you can configure the costs of insertion, substitution, and deletion operations, each defaulting to specific values (1 for insertion and deletion, and 2 for substitution), alongside a limit on the maximum number of allowable substitutions, set by default to 2. The Hamming algorithm is simpler, offering only a configuration for the maximum number of substitutions, also defaulting to 2. These settings enable fine-tuning the algorithms for diverse text processing and pattern-matching requirements, and understanding these options will help you customize the algorithms effectively as you proceed through the documentation.
Search Algorithm: Ukkonen
--search-ukkonen-icost
or SEARCH_UKKONEN_ICOST=
Default: 1
--search-ukkonen-scost
or SEARCH_UKKONEN_SCOST=
Default: 2
--search-ukkonen-dcost
or SEARCH_UKKONEN_DCOST=
Default: 1
--search-ukkonen-max-substitutions
or SEARCH_UKKONEN_MAX_SUBSTITUTIONS=
Default: 2
Search Algorithm: Wagner Fischer
--search-wagner-fischer-icost
or SEARCH_WAGNER_FISCHER_ICOST=
Default: 1
--search-wagner-fischer-scost
or SEARCH_WAGNER_FISCHER_SCOST=
Default: 2
--search-wagner-fischer-dcost
or SEARCH_WAGNER_FISCHER_DCOST=
Default: 1
--search-wagner-fischer-max-substitutions
or SEARCH_WAGNER_FISCHER_MAX_SUBSTITUTIONS=
Default: 2
Search Algorithm: Hamming
--search-hamming-max-substitutions
or SEARCH_HAMMING_MAX_SUBSTITUTIONS
Default: 2
Application Settings
--access-log
or ACCESS_LOG=
Default: ""
When defined, an absolute path to a log file should have logrotation enabled on it, that will contain the Gin web server access log.
--gin-log-stdout
or GIN_LOG_STDOUT=
Default: true
When true
, Gin, the web server package used in apario-reader
will send its logs to /dev/stdout
instead of a file path defined by --access-log
.
--unsecure-port
or UNSECURE_PORT=
Default: 8080
The port to bind HTTP connections to.
--secure-port
or SECURE_PORT=
Default: 8443
The port to bind HTTPS connections to.
--force-https
or FORCE_HTTPS=
Default: false
When set to true, connections coming into http://yourdomain.com
will be upgraded to https://yourdomain.com
automatically.