Pull Request Analysis Worker

This worker analyzes the open pull requests of every repository and predicts the probability of it getting accepted and merged. Pull requests having a low probability of getting merged, are indicative of being outlier/anomalous ones.

Worker Configuration

To kickstart the worker, it needs to receive a task from the Housekeeper, similar to other workers, and have a corresponding worker specific configuration.

The standard options are:

switch - a boolean flag indicating if the worker should automatically be started with Augur. Defaults to 0 (false).
workers - the number of instances of this worker that Augur should spawn if switch is set to 1. Defaults to 1.
port - the TCP port the worker will use to communicate with Augur’s broker, the default being 51400.
insight_days - open PRs created in the duration of the past x days are analyzed.

Note

insight_days can be adjusted to analyze very recently opened PRs as well. Run the data collection workers to have enough data to analyze!
This worker uses some methods of the Message Insights Worker

Worker Pipeline

When a repo is analyzed, the trained ML models are used to predict the probability of acceptance. The major factors influencing this are:

Pull request characteristics: No. of commits, sentiment of PR title
Contributor characteristics: Acceptance rate of PRs created, no. of projects contributed to in the past
Repo characteristics: No. of open issues, no. of watchers, past acceptance rate of PRs
Discussion characteristics: No. of comments, no. of participants, average sentiment of all comments

After feature engineering, the following features were considered:

No. of commits
No. of comments
Length of PR in days
Relationship between PR creator and repo (member, collaborator, etc)
Average sentiment score of comments
Watch count of repo
Past acceptance ratio of a PR in the repo

pull_request_analysis_worker/
├── __init__.py
├── pull_request_analysis_worker.py
├── runtime.py
├── setup.py
└── trained_pr_model.pkl

The trained_pr_model.pkl is the saved pre-trained model, used for the analysis.

After prediction, the pull_request_analysis table is populated with the predicted probabilities for every open PR.