ADMINISTRATOR - November 2024 - Apario Reader Configuration

Running your own instance of Apario Reader can be challenging when you don't know how to configure the application properly. This guide will help you win, and win bigly.

ADMINISTRATOR - November 2024 - Apario Reader Configuration
Photo by Javier Allegue Barros / Unsplash

In January 2024, Project Apario released on GitHub, the Apario Reader OpenSource Go Web Application. Packaged in an all-in-one application that can be setup in minutes, the Apario Reader provides users advanced configurations that alter the way the application runs. With limited resources at our disposal, assembling the required documentation for the project is on the honey-do list, however its still an area that needs a little attention; and so I would like to invite you the opportunity now to learn about the higher levels of the Apario Reader configuration and what the scope of the project offers to users who wish to run their own 5D OSINT.

The Configurable Package

GitHub - andreimerlescu/configurable
Contribute to andreimerlescu/configurable development by creating an account on GitHub.

Created on June 8th, 2023; the configurable package on GitHub is written in Go, that has 5 dependencies, which runs on Go 1.23 and above. Stored inside the config.go file within the Apario Reader, is the implementation of the configurable package.

Within the Go ecosystem, there are many amazing and fantastic OpenSource packages available to complete various tasks needed; however the act of establishing configuration parameters felt too cumbersome across the marketplace; and so I wanted to offer something easier to use, better in my opinion, and highly extensible. Other packages focus on other priorities, and the configurable package is no different.

Within Go, a fundamental concept called interfaces is used to provide a contract to end-users of the configurable package. The primary interface in configurable is called IConfigurable. The New() func returns an IConfigurable interface type that contains a handful of functions to it.

IConfigurable Interface
Int(name string) *int
NewInt(name string, value int, usage string) *int
Int64(name string) *int64
NewInt64(name string, value int64, usage string) *int64
Float64(name string) *float64
NewFloat64(name string, value float64, usage string) *float64
String(name string) *string
NewString(name, value, usage string) *string
Bool(name string) *bool
NewBool(name string, value bool, usage string) *bool
Duration(name string) *time.Duration
NewDuration(name string, value time.Duration, usage string) *time.Duration
List(name string) *[]string
NewList(name string, value []string, usage string) *[]string
Map(name string) *map[string]string
NewMap(name string, value map[string]string, usage string) *map[string]string
LoadFile(filename string) error
Parse(filename string) error
Usage() string

And when using it:

package main

import (
	"fmt"

	"github.com/andreimerlescu/configurable"
)

func main() {
	config := configurable.New()
	debug := config.NewBool("debug", false, "Enable debug mode")
	err := config.Parse("config.yaml")
	if err != nil {
		panic(err)
	}
	fmt.Println("Debug mode:", *debug)
}

Go Playground: https://go.dev/play/p/YefJDFtlAPR

This package is used within the Apario Reader and it provides the capabilities to take variables, such as the example debug above, and dereference them with *debug to retrieve the value of the configurable. The Parse() func accepts the name of a file that you wish to allow to load into your configurations. Assigning values to your configurable variables can be done using multiple techniques. These techniques are provided to you when you create new configurable variables using the IConfigurable methods NewInt, NewInt64, NewFloat64, NewString, NewBool, NewDuration, NewList, and NewMap. The arguments for these functions are all the same; which make them easy to remember and use in the future. The type of the New follows in the function name itself, and the first argument is always a string, but the 2nd argument is the same type as the func's name suggests; indicating that NewFloat64 requires argument 2 to be float64. Argument three is the description of the flag.

Within each of the New methods is a special set of logic that takes the first argument; the name of the configurable, and allows it to be found via Environment Variables or by using --<the name here> in the CLI when you run go run . for your project. This means; for the example debug argument assigned to the variable debug; the user of the application you're building could run go run . --debug true or DEBUG=true go run . or export DEBUG=true; go run .. Either are available to you; and if you're leveraging .Parse(filePath) then you can create either a .ini, .json, or .yaml file. So; in the example, config.yaml would look something like:

---
debug: true

Then, when the application is executed again, go run .; the variable debug gets the value true assigned to it, because its been defined in the config.yaml file and loaded into the runtime of the application using .Parse(filePath).

The package configurable also allows you to mix and match definitions of various arguments. For example, if you're using debug, then you may not want to encourage or document the --debug option everywhere; but rather allow users to establish system level globals using environment variables instead. This means that within your configuration file, you can have some parts of the configuration that really come from environment variables, some that come from encouraged CLI arguments directly, and others that are more fixed for long-term use that get stored inside configuration files.

Apario Reader Configuration Groups

With 117+ configurables offered within the Apario Reader application, there are 6 categories of configurations that controls the runtime.

  1. Global Settings
  2. Database Settings
  3. Appliance Settings
  4. Encryption Settings
  5. Advanced Search Settings
  6. Application Settings

Global Settings

--product-name "" or PRODUCT_NAME=""

Default: apario-reader

--environment or ENVIRONMENT=

Default: development

When you're reading to run a live instance, set this to production.

--production-environment-label or PRODUCTION_ENVIRONMENT_LABEL=

Default: production

When used, this can override what the --environment label for true production to reflect; when ignored; production is the tag required for --environment to set the release mode of the application to production mode. When --production-environment-label="staging" then the use of --environment="production" will mean that the host will not be considered in production mode; but if you set --production-environment-label="honey" and --environment="honey" then the runtime of the application would be production mode tagged as honey.

--info-log or INFO_LOG=

Default: <nothing>

When used, this is the path to the primary application's log - not the web traffic log. This file should be subject to log rotation as it can become a single point of failure on an instance of apario-reader that is left running for months at a time unattended.

--config or CONFIG=

Default: config.yaml

When used, this is the filename of the *.yaml config file that the application depends on for its runtime information when the full set of command line arguments are not used in conjunction with the apario-reader binary.

--enable-ping or ENABLE_PING=

Default: false

When set to true, a new route will be exposed as GET, POST, PUT, UPDATE, PATCH, PUSH, DELETE, (any HTTP verb) as /ping will respond with PONG.

--site-title or SITE_TITLE=

Default: idoread.com

This is the name of the website that will appear all throughout the interface of Apario Reader.

--site-company or SITE_COMPANY=

Default: Project Apario LLC

This is the name of the organization or company that is operating the instance of Apario Reader for the world. Put your name or group here. If you're a podcast making an instance of apario-reader for your community, enter your podcast name here.

--primary-domain or PRIMARY_DOMAIN=

Default: <undefined>

When used, this will assign the domain name of the instance to whatever value you place here. When configuring HTTPS, you'll need to ensure that the Certificate is issued for this domain name.

--decimal-symbol or DECIMAL_SYMBOL=

Default: .

When defined, this allows you to override the locale settings of the instance of apario-reader to replace . decimal places for numbers with whatever value you place here as --decimal-symbol="," so that $10.99 becomes $10,99 when rendered.

Database Settings

--database or DATABASE=

Default: <undefined>

This is the path to your apario-writer output destination. This should be a simple directory like /apario/database where the subdirectories inside are the SHA512 checksum of the source URL of the document. Recommended to have this on its own drive, disconnected from the drive that can write to the logs. This drive will most likely be very large and serving the content you require. Keep in mind the speed of your disks too. For SAS drives connected over the network, you'll consistently see 500MB/s throughput; for SSD drives you'll see closer to 600MB/s throughput and for NVMe drives you'll see closer to 1700MB/s throughput. For 5400 RPM HDD drives (non-enterprise grade), you'll see 120MB/s throughput and a life expectancy out of the drive to be half of what you'd experience with an enterprise grade HDD. For a 7200 RPM enterprise grade HDD, you'll see performance around 220MB/s throughput. Given that thumbnails are 1/8 MB; 220MB/s can throughput up to 1760 thumbnails/second. Given that full size pages - through StumbleInto are around 7/8 MB each, 250 StumbleInto/second. For NVMe drives, you're looking at 13,600 thumbnails/second throughput or 1,942 StumbleInto/second.

--persist-runtime-database or PERSIST_RUNTIME_DATABASE=

Default: true

When true, this will create an app.db directory of .json files that contain the in-memory data required for full-text search. This offloads the initial boot time of the application by building the index when this is set to true but will not load this database. This flag saves the index database to disk so it can be loaded using the next configurable.

--load-persistent-database or LOAD_PERSISTENT_DATABASE=

Default: true

When true, this will read the app.db if its populated and load it into memory. When combined with --persist-runtime-database, this speeds up boot time of the application considerably.

--flush-database-watch-file or FLUSH_DATABASE_WATCH_FILE=

Default: <undefined>

When defined, the application will watch the path specified in this configurable to watch. Therefore, when this file is created on the system, and the goroutine watching this path, it will trigger the database to flush itself and rebuild its index. This will produce a resource intensive task that will place users going into your site in a waiting room until the migration has completed.

--users-database-path or USERS_DATABASE_PATH=

Default: <undefined>

When defined, a directory called users.db is created that stores *.json files for all user accounts on your instance. Their cached Apario Identity public profile will be here. All users who connect to your instance are here.

--snippets-database-path or SNIPPETS_DATABASE_PATH=

Default: <undefined>

When defined, a directory called snippets.db is created that stores *.json files alongside rendered assets for all snippets created on the instance. Snippets metadata is submitted to SnipThat.info but snippet data itself is stored on the instance of apario-reader hosting that OSINT.

--tag-database-path or TAG_DATABASE_PATH=

Default: <undefined>

When defined, a directory called tags.db is created that stores *.json files for tags. Each tag is a file. Each tag contains metadata and properties to them that are visible throughout the apario-reader application. The tag database path is where Gematria is stored for tags and tag-relationships are established. A tag is not just a single word; its a word and a type combined together to mean something entirely new. Therefore, tags have relationships to tags and tags are one way to find the discoverability of content. Tagging a document requires Reputation. Tagged documents are stored in this database directory.

--database-concurrent-write-semaphore or DATABASE_CONCURRENT_WRITE_SEMAPHORE

Default: 1

When defined, a semaphore will be used to limit the number of concurrent write operations permitted on the apario-writer database database.db. This isn't the app.db or users.db or tags.db or snippets.db; its the OSINT database itself. This semaphore controls how documents are changed through the Proposal system and what level of control you want to have over your instance. For instances that I maintain, I set this to 1 so I know for sure that nothing is changing the database en-masse.

--textee-database or TEXTEE_DATABASE=

Default: <undefined>

When defined, a directory called textee.db will be created that will keep track of all substring v. gematria index results that are used for partial full-text search.

Appliance Settings

These settings change the "appliance" state of the machine running apario-reader. Since the application runs on bare metal hardware and is designed to use 100% of the system it has available to it, it's called an appliance and not an application. For that reason, these settings have a broader impact on the host machine running apario-reader than the clients connecting to your instance reading OSINT.

Encryption Settings

--tls-public-key or TLS_PUBLIC_KEY=

Default: ""

Expected to be A PATH to a PEM formatted SSL certificate and CA bundle, the TLS Public Key configurable is a string type variable that requires read permissions on the path and the file to contain the Certificate + CA Bundle concatenated together.

--tls-private-key or TLS_PRIVATE_KEY=

Default: ""

Expected to be A PATH to a PEM formatted SSL private key, the TLS Private Key configurable is a string type variable that requires read permissions on the path and the file to contain the matching private key to the --tls-public-key configurable.

--tls-private-key-password or TLS_PRIVATE_KEY_PASSWORD=

Default: ""

The decryption password for the TLS private key (if applicable).

--auto-tls or AUTO_TLS=

Default: false

When set to true, apario-reader will issue a self-signed SSL certificate to bind your instance to HTTPS connections.

--tls-life-min or TLS_LIFE_MIN=

Default: ""

When defined, a duration in minutes set as an integer defines the life of the TLS certificate validity in memory before being re-checked.

--tls-expires-in or TLS_EXPIRES_IN=

Default: ""

When defined, this will be the expiration of the TLS certificate that is issued in a self-signed manner.

--tls-company or TLS_COMPANY=

Default: ""

When defined, this is the name that appears in the SSL Certificate details.

--tls-san-ip or TLS_SAN_IP=

Default: ""

When defined, the TLS Certificate created for the self-signed HTTPS experience on the apario-reader; the generated SSL Certificate can have SAN IP addresses attached to it, so the certificate can be used elsewhere throughout your services. IP bound SSL certificates offer more security, and when using the --auto-tls functionality within apario-reader, setting the --tls-san-ip to the IP address of the host that is running the instance (you can run curl -L https://ip4.washere.dev to get the IPv4 address of your WAN); and defining this IP address here allows the bind(":443") to attach only to the IP of your apario-reader instance. This protects your users and yourself from potential issues caused by using self signed certificates.

--tls-additional-domains or TLS_ADDITIONAL_DOMAINS=

Default: ""

When defined, additional domain names will be secured using the --auto-tls so that when you have mydomain1.com DNS A record pointing to apario-reader, and another mydomain2.com DNS A record pointing to the same instance of apario-reader will bind their SSL connections to the mydomain1.com or mydomain2.com and using multi-domain SSL certificates allows you to bind your single apario-reader process into mydomain1.com / mydomain2.com serving your OSINT content. Setting these additional domains allows you to use a self-signed certificate and define these domain names that you wish to use. If your users are okay with accepting the risks associated with self-signed certificates, you can secure your instance of Apario Reader with multiple IP addresses, multiple domains and do it in a way that you control the expiration.

--session-secret or SESSION_SECRET=

Default: ""

When defined, sessions will receive their native encryption using this secret; otherwise a random string will be used and this can cause issues in production between deployments; setting this value assures that between deployments of apario-reader, connected sessions retain their data by means of their ability to decrypt their active session in-between apario-reader process restarts.

--session-authenticity-token-secret or SESSION_AUTHENTICITY_TOKEN_SECRET=

Default: ""

When defined, the csrf authenticity token is cryptographically generated using this secret; thus when you deploy instances of apario-reader in the wild and are upgrading your instance, the csrf token will remain valid between instance restarts; otherwise, this value assumes that functionality.

Advanced Search Settings

--concurrent-searches or CONCURRENT_SEARCHES=

Default: 30

Advanced Search is built into all instances of apario-reader and they are considered resource-intensive tasks. As such, you need to define a limit of the number of concurrent searches that you wish for apario-reader to perform concurrently. When this limit is reached, the users connection to the page hangs until space opens up for the search for them. When new users enter the full search, they are presented with the waiting room offering them a place in line.

--search-algorithm or SEARCH_ALGORITHM=

Default: jaro_winkler

I won't search for things you can search for yourself, but Jaro Winkler is an algorithm that is enabled by default that is used to perform searches. Alternative algorithms include jaro, jaro_winkler, wagner_fisher, ukkonen and hamming. Each of these algorithms will slightly tweak the results of Advanced Search since they are the means that apario-reader uses to correct for the 71% accuracy of the OSINT OCR (optical character recognition) output from Tesseract. By enabling these algorithms, matches for kenedy will match kennedy under these algorithms; thus allowing for OCR errors and user-errors with a 20% grace given to you.

--search-concurrent-buffer or SEARCH_CONCURRENT_BUFFER=

Default: 369

These are concurrency channels used for processing search results. Tune this if your server is slow or you want to increase throughput since you're not seeing much utilization on your box despite users on it.

--search-concurrency-limiter or SEARCH_CONCURRENCY_LIMITER=

Default: 9

This is also used for concurrency channels to process search results. Again, tuning this number helps you. This number isn't for the buffer of results, its for the concurrency limiter of the results processor.

--search-timeout-seconds or SEARCH_TIMEOUT_SECONDS=

Default: 30 (seconds)

How many seconds you allow Advanced Search to run for before terminating the connection request to the user and delivering the results as-is to them.

Search Algorithm: Jaro

--search-threshold-jaro or SEARCH_THRESHOLD_JARO=

Default: 0.71

This threshold value is used to adjust the Jaro algorithm only. A 0.71 value represents a comfortable balance of % of characters that must be valid to match. Too high of a value means lowering the grace allowed for the OCR + user errors/typos in their search keywords. To low of a value and you see matches for apple and car because they both contain an a in them.

Search Algorithm: Jaro Winkler

--search-threshold-jaro-winkler or SEARCH_THRESHOLD_JARO_WINKLER=

Default: 0.71

--search-jaro-winkler-boost-threshold or SEARCH_JARO_WINKLER_BOOST_THRESHOLD=

Default: 0.7

--search-jaro-winkler-prefix-size or SEARCH_JARO_WINKLER_PREFIX_SIZE=

Default: 3

The documentation outlines configurable parameters for three key search algorithms: Ukkonen, Wagner Fischer, and Hamming. Each algorithm provides options to adjust operation costs and constraints, tailored to optimize performance or accuracy based on specific use cases. For both Ukkonen and Wagner Fischer, you can configure the costs of insertion, substitution, and deletion operations, each defaulting to specific values (1 for insertion and deletion, and 2 for substitution), alongside a limit on the maximum number of allowable substitutions, set by default to 2. The Hamming algorithm is simpler, offering only a configuration for the maximum number of substitutions, also defaulting to 2. These settings enable fine-tuning the algorithms for diverse text processing and pattern-matching requirements, and understanding these options will help you customize the algorithms effectively as you proceed through the documentation.

Search Algorithm: Ukkonen

--search-ukkonen-icost or SEARCH_UKKONEN_ICOST=

Default: 1

--search-ukkonen-scost or SEARCH_UKKONEN_SCOST=

Default: 2

--search-ukkonen-dcost or SEARCH_UKKONEN_DCOST=

Default: 1

--search-ukkonen-max-substitutions or SEARCH_UKKONEN_MAX_SUBSTITUTIONS=

Default: 2

Search Algorithm: Wagner Fischer

--search-wagner-fischer-icost or SEARCH_WAGNER_FISCHER_ICOST=

Default: 1

--search-wagner-fischer-scost or SEARCH_WAGNER_FISCHER_SCOST=

Default: 2

--search-wagner-fischer-dcost or SEARCH_WAGNER_FISCHER_DCOST=

Default: 1

--search-wagner-fischer-max-substitutions or SEARCH_WAGNER_FISCHER_MAX_SUBSTITUTIONS=

Default: 2

Search Algorithm: Hamming

--search-hamming-max-substitutions or SEARCH_HAMMING_MAX_SUBSTITUTIONS

Default: 2

Application Settings

--access-log or ACCESS_LOG=

Default: ""

When defined, an absolute path to a log file should have logrotation enabled on it, that will contain the Gin web server access log.

--gin-log-stdout or GIN_LOG_STDOUT=

Default: true

When true, Gin, the web server package used in apario-reader will send its logs to /dev/stdout instead of a file path defined by --access-log.

--unsecure-port or UNSECURE_PORT=

Default: 8080

The port to bind HTTP connections to.

--secure-port or SECURE_PORT=

Default: 8443

The port to bind HTTPS connections to.

--force-https or FORCE_HTTPS=

Default: false

When set to true, connections coming into http://yourdomain.com will be upgraded to https://yourdomain.com automatically.