Collecting data
Now that you’ve installed Augur’s application server, it’s time to configure your data collection workers. If you just want to run Augur using the one repository in the default database, and default worker settings, all you need to do is start the redis server in one terminal, make sure rabbitmq is running, and the augur application in the other terminal. (Don’t forget that the AUGUR_DB environment variable needs to be set in the terminal, or set permanently)
# Terminal Window 1
# Starts the redis server
redis-server
# Terminal Window 2
# Start celery worker so it can accept tasks
celery -A augur.tasks.init.celery_app.celery_app worker --loglevel=info
# Terminal Window 3
# To Start Augur:
(nohup augur backend start)
# To Stop Augur:
augur backend stop
augur backend kill
Now, here’s a ton of brain-splitting detail about workers, and their configuration. There are 2 pieces to data collection with Augur: the housekeeper, and the data collection workers. The housekeeper creates long-running “jobs” that specify what kind of data to collect for what set of repositories. The data collection workers can then accept these jobs, after which they will use the information provided in the job to find the repositories in question and collect the requested data.
Since the default housekeeper setup will work for most use cases, we’ll first cover how to configure the workers and then briefly touch on the housekeeper configuration options, after which we’ll cover how to add repos and repo groups to the database.
Configuring the Workers
There are a few workers that ship ready to collect out of the box:
facade_worker
(collects raw commit and contributor data by parsing Git logs)github_worker
(collects issue data from the GitHub API)contributor_worker
(collects contributor data from the GitHub API)pull_request_worker
(collects pull request data from the GitHub API)repo_info_worker
(collects repository statistics from the GitHub API)release_worker
(collects release data from the GitHub API)linux_badge_worker
(collects CII badging data from the CII API)insight_worker
(queries Augur’s metrics API to find interesting anomalies in the collected data)
All worker configuration options are found in the Workers
block of the augur.config.json
file (which was generated for you at the end of the previous section). This file is located at $HOME/.augur/augur.config.json
. Each worker has its subsection with the same title as the worker’s name. We recommend leaving the defaults and only changing them when explicitly necessary, as the default parameters will work for most use cases. Read on for more on how to make sure your workers are properly configured.
Standard configuration options
Each worker has 3 configuration options that are standard across all workers. The worker-specific options are detailed in the sections following this one.
The standard options are:
switch
, a boolean flag indicating if the worker should automatically be started with Augur. Defaults to0
(false).workers
, the number of instances of this worker that Augur should spawn ifswitch
is set to1
. Defaults to1
for all workers except thevalue_worker
andinsight_worker
.port
, which is the base TCP port the worker will use the communicate with Augur’s broker. The default is different for each worker, but the lowest is50100
and each worker increments the default starting port by 100. If theworkers
parameter is > 1, then workers will bind toport
+i
for thei
’th worker spawned
Keeping workers
at 1 should be fine for small collection sets, but if you have a lot of repositories to collect data for, you can raise it. We also suggest double-checking that the default worker ports are free on your machine.
Worker-specific configuration options
Next up are the configuration options specific to each worker (but some workers require no additional configuration beyond the defaults). The most pertinent of these options is the facade_worker's
repo_directory
, so make sure to pay attention to that one.
facade_worker
repo_directory
, which is the local directory where thefacade_worker
will clone the repositories it needs to analyze. You should have been prompted for this during installation, but if you need to change it, make sure that it’s an absolute path (environment variables like$HOME
are not supported) and that the directory already exists. Defaults torepos/
, but it’s highly recommended you change this.
insight_worker
We recommend leaving the defaults in place for the insight worker unless you are interested in other metrics, or anomalies for a different time period.
training_days
, which specifies the date range that theinsight_worker
should use as its baseline for the statistical comparison. Defaults to365
, meaning that the worker will identify metrics that have had anomalies compared to their values over the course of the past year, starting at the current date.anomaly_days
, which specifies the date range in which theinsight_worker
should look for anomalies. Defaults to2
, meaning that the worker will detect anomalies that have only occured within the past two days, starting at the current date.contamination
, which is the “sensitivity” parameter for detecting anomalies. Acts as an estimated percentage of the training_days that are expected to be anomalous. The default is0.041
for the default training days of 365: 4.1% of 365 days means that about 15 data points of the 365 days are expected to be anomalous.metrics
, which specifies which metrics theinsight_worker
should run the anomaly detection algorithm on. This is structured like so::- {
‘endpoint_name_1’: ‘field_1_of_endpoint’, ‘endpoint_name_1’: ‘field_2_of_endpoint’, ‘endpoint_name_2’: ‘field_1_of_endpoint’, …
}
# defaults to the following
- {
“issues-new”: “issues”, “code-changes”: “commit_count”, “code-changes-lines”: “added”, “reviews”: “pull_requests”, “contributors-new”: “new_contributors”
}
value_worker
scc_bin
, the command that thevalue_worker
should use to invokescc
. If installed withgo get github.com/boyter/scc
, then the default ofscc
should probably work, but double check for your particular Go installation.
Housekeeper
We strongly recommend leaving the default housekeeper blocks generated by the installation process, but if you would like to know more, or fine-tune them to your needs, read on.
The housekeeper is responsible for generating the tasks that will tell the workers what data to collect, and how. Housekeeper configuration options are found in the Housekeeper
block of the config file. The Housekeeper
block has a single key, jobs
, which is an array of tasks the housekeeper should create. Each task has the following structure:
{
"delay": <int>,
"given": [
"<string>"
],
"model": "<string>",
"repo_group_id": <int>,
... //other task-specific parameters
}
The
delay
parameter is the amount of time the housekeeper should wait before scheduling a new update task.The
given
parameter is used in conjunction with themodel
parameter to determine which workers can accept a data collection task. Each worker can collect data if it is “given” data in a certain format, for example, agithub_url
(in the case of thegithub_worker
andpull_request_worker
) or perhaps just any validgit_url
(as in the case of thefacade_worker
).The
model
parameter is the other parameter used to determine which workers can accept a given task. It represents the part of the conceptual data model that the worker can fulfill; for example, thefacade_worker
fills out thecommits
model since it primarly gathers data about commits, and thegithub_worker
fills out both theissues
andcontributors
model.The
repo_group_id
parameter specifies which group of repos the housekeeper should collect data for; use the default of0
to specify ALL repo groups in the database.
Adding repos for collection
If you’re using the Docker container, you can use the provided UI to load your repositories. Otherwise, you’ll need to use the Augur CLI to load your repositories. Please reference the respective sections of the documentation for detailed instructions on how to accomplish both of these steps.
Running collections
Congratulations! At this point you (hopefully) have a fully functioning and configured Augur instance.
After you’ve loaded your repos, you’re ready for your first collection run. We recommend running only the default workers first to gather the initial data. If you’re collecting data for a lot of repositories, or repositories with a lot of data, we recommend increasing the number of github_workers
and pull_request_workers
.
You can now run Augur and start the data collection by issuing the augur backend start
command in the root augur
directory. All your logs (including worker logs and error files) will be saved to a logs/
subdirectory in that same folder, but this can be customized - more on that and other logging utilities in the development guide.
Once you’ve finished the initial data collection, we suggest then running the value_worker
(if you have it installed) and the insight_worker
. This is because the value_worker
depends on the source files of the repositories cloned by the facade_worker
, and the insight_worker
uses the data from all the other workers to identify anomalies in the data by by performing statistical analysis on the data returned from Augur’s metrics API.
You’re now ready to start exploring the data Augur can gather and metrics we can generate. If you’re interested in contributing to Augur’s codebase, you can check out the development guide. For information about Augur’s frontend, keep reading!
Happy collecting!