Pull Request Analysis Worker

This worker analyzes the open pull requests of every repository and predicts the probability of it getting accepted and merged. Pull requests having a low probability of getting merged, are indicative of being outlier/anomalous ones.

Worker Configuration

To kickstart the worker, it needs to receive a task from the Housekeeper, similar to other workers, and have a corresponding worker specific configuration.

The standard options are:

  • switch - a boolean flag indicating if the worker should automatically be started with Augur. Defaults to 0 (false).

  • workers - the number of instances of this worker that Augur should spawn if switch is set to 1. Defaults to 1.

  • port - the TCP port the worker will use to communicate with Augur’s broker, the default being 51400.

  • insight_days - open PRs created in the duration of the past x days are analyzed.

Note

  • insight_days can be adjusted to analyze very recently opened PRs as well. Run the data collection workers to have enough data to analyze!

  • This worker uses some methods of the Message Insights Worker

Worker Pipeline

When a repo is analyzed, the trained ML models are used to predict the probability of acceptance. The major factors influencing this are:

  1. Pull request characteristics: No. of commits, sentiment of PR title

  2. Contributor characteristics: Acceptance rate of PRs created, no. of projects contributed to in the past

  3. Repo characteristics: No. of open issues, no. of watchers, past acceptance rate of PRs

  4. Discussion characteristics: No. of comments, no. of participants, average sentiment of all comments

After feature engineering, the following features were considered:

  • No. of commits

  • No. of comments

  • Length of PR in days

  • Relationship between PR creator and repo (member, collaborator, etc)

  • Average sentiment score of comments

  • Watch count of repo

  • Past acceptance ratio of a PR in the repo

pull_request_analysis_worker/
├── __init__.py
├── pull_request_analysis_worker.py
├── runtime.py
├── setup.py
└── trained_pr_model.pkl

The trained_pr_model.pkl is the saved pre-trained model, used for the analysis.

After prediction, the pull_request_analysis table is populated with the predicted probabilities for every open PR.