Collecting data
Now that you’ve installed Augur’s application server, it’s time to configure data collection if needed. If you just want to run Augur using the default repositories in the default database, and default celery collection settings, all you need to do is start the redis server in one terminal, make sure rabbitmq is running, and the augur application in the other terminal. (Don’t forget that the AUGUR_DB environment variable needs to be set in the terminal, or set permanently)
# Terminal Window 1
# Starts the redis server
redis-server
# Terminal Window 3
# To Start Augur:
(nohup augur backend start)
# To Stop Augur:
augur backend stop
augur backend kill
Now, here’s a ton of brain-splitting detail about celery collection. There are 2 pieces to data collection with Augur: the celery worker processes, and the job messages passed through rabbitmq. The jobs to collect are determined by a monitor process started through the cli that starts the rest of augur. The monitor process generates the jobs messages to send to rabbitmq through the collection_status table that informs the status of jobs that have yet to be run. The celery collection workers can then accept these jobs, after which they will use the information provided in the job to find the repositories in question and collect the requested data.
Since the default setup will work for most use cases, we’ll first cover how to configure some specific data collection jobs and then briefly touch on the celery configuration options, after which we’ll cover how to add repos and repo groups to the database.
Configuring Collection
There are many collection jobs that ship ready to collect out of the box:
augur.tasks.git.facade_taks
(collects raw commit and contributor data by parsing Git logs)augur.tasks.github
(parent module of all github specific collection jobs)augur.tasks.github.contributors.tasks
(collects contributor data from the GitHub API)augur.tasks.github.pull_requests.tasks
(collects pull request data from the GitHub API)augur.tasks.github.repo_info.tasks
(collects repository statistics from the GitHub API)augur.tasks.github.releases.tasks
(collects release data from the GitHub API)augur.tasks.data_analysis.insight_worker.tasks
(queries Augur’s metrics API to find interesting anomalies in the collected data)
All worker configuration options are found in the config table generated when augur was installed. The config table is located in the augur_operations schema of your postgresql database. Each configurable data collection job set has its subsection with the same or similar title as the task’s name. We recommend leaving the defaults and only changing them when explicitly necessary, as the default parameters will work for most use cases. Read on for more on how to make sure your workers are properly configured.
Worker-specific configuration options
Next up are the configuration options specific to some collection tasks (but some tasks require no additional configuration beyond the defaults). The most pertinent of these options is the Facade
section repo_directory
, so make sure to pay attention to that one.
Facade
repo_directory
, which is the local directory where the facade tasks will clone the repositories it needs to analyze. You should have been prompted for this during installation, but if you need to change it, make sure that it’s an absolute path (environment variables like$HOME
are not supported) and that the directory already exists. Defaults torepos/
, but it’s highly recommended you change this.limited_run
, toggle between 0 and 1 to determine whether to run all facade tasks or not. Runs all tasks if set to 0pull_repos
, toggle whether to pull updates from repos after cloning them. If turned off updates to repos will not be collected.run_analysis
, toggle whether to process commit data at all. If turned off will only clone repos and run tertiary tasks such as resolving contributors from any existing commits or collecting dependency relationships. Mainly used for testing.run_facade_contributors
, toggle whether to run contributor resolution tasks. This will process and parse through commit data to link emails to contributors as well as aliases, etc.force_invalidate_caches
, set every repo to reset the status of commit email affillation, which is the organization that an email is associated with.rebuild_caches
, toggle whether to enable parsing through commit data to determine affillation and web cache
Insight_Task
We recommend leaving the defaults in place for the insight worker unless you are interested in other metrics, or anomalies for a different time period.
training_days
, which specifies the date range that theinsight_worker
should use as its baseline for the statistical comparison. Defaults to365
, meaning that the worker will identify metrics that have had anomalies compared to their values over the course of the past year, starting at the current date.anomaly_days
, which specifies the date range in which theinsight_worker
should look for anomalies. Defaults to2
, meaning that the worker will detect anomalies that have only occured within the past two days, starting at the current date.contamination
, which is the “sensitivity” parameter for detecting anomalies. Acts as an estimated percentage of the training_days that are expected to be anomalous. The default is0.041
for the default training days of 365: 4.1% of 365 days means that about 15 data points of the 365 days are expected to be anomalous.switch
, toggles whether to run insight tasks at all.workers
, number of worker processes to use for insight tasks.
Task_Routine
This section is for toggling sets of jobs on or off.
prelim_phase
, toggles whether to run preliminary tasks that check to see whether repos are valid or not.primary_repo_collect_phase
, toggle the standard collection jobs, mainly pull requests and issuessecondary_repo_collect_phase
, toggle the secondary collection jobs, mainly jobs that take a whilefacade_phase
, toggle all facade jobsmachine_learning_phase
, toggle all ml related jobs
Celery Configuration
We strongly recommend leaving the default celery blocks generated by the installation process, but if you would like to know more, or fine-tune them to your needs, read on.
The celery monitor is responsible for generating the tasks that will tell the other worker processes what data to collect, and how. The Celery
block has 2 keys; one for memory cap and one for materialized views interval.
- worker_process_vmem_cap
, float between zero and one that determines the maximum percentage of total memory to use for worker processes
refresh_materialized_views_interval_in_days
, number of days to wait between refreshes of materialized views.
Adding repos for collection
If you’re using the Docker container, you can use the provided UI to load your repositories. Otherwise, you’ll need to use the Augur CLI or the augur frontend to load your repositories. Please reference the respective sections of the documentation for detailed instructions on how to accomplish both of these steps.
Running collections
Congratulations! At this point you (hopefully) have a fully functioning and configured Augur instance.
After you’ve loaded your repos, you’re ready for your first collection run. We recommend running only the default jobs first to gather the initial data.
You can now run Augur and start the data collection by issuing the augur backend start
command in the root augur
directory. All your logs (including worker logs and error files) will be saved to a logs/
subdirectory in that same folder, but this can be customized - more on that and other logging utilities in the development guide.
Once you’ve finished the initial data collection, we suggest then running the value_worker
(if you have it installed) and the insight_worker
. This is because the value_worker
depends on the source files of the repositories cloned by the facade_worker
, and the insight_worker
uses the data from all the other workers to identify anomalies in the data by by performing statistical analysis on the data returned from Augur’s metrics API.
You’re now ready to start exploring the data Augur can gather and metrics we can generate. If you’re interested in contributing to Augur’s codebase, you can check out the development guide. For information about Augur’s frontend, keep reading!
Happy collecting!