Pull Request Analysis Worker
This worker analyzes the open pull requests of every repository and predicts the probability of it getting accepted and merged. Pull requests having a low probability of getting merged, are indicative of being outlier/anomalous ones.
Worker Configuration
To kickstart the worker, it needs to receive a task from the Housekeeper, similar to other workers, and have a corresponding worker specific configuration.
The standard options are:
switch
- a boolean flag indicating if the worker should automatically be started with Augur. Defaults to0
(false).workers
- the number of instances of this worker that Augur should spawn ifswitch
is set to1
. Defaults to1
.port
- the TCP port the worker will use to communicate with Augur’s broker, the default being51400
.insight_days
- open PRs created in the duration of the pastx
days are analyzed.
Note
insight_days
can be adjusted to analyze very recently opened PRs as well. Run the data collection workers to have enough data to analyze!This worker uses some methods of the Message Insights Worker
Worker Pipeline
When a repo is analyzed, the trained ML models are used to predict the probability of acceptance. The major factors influencing this are:
Pull request characteristics: No. of commits, sentiment of PR title
Contributor characteristics: Acceptance rate of PRs created, no. of projects contributed to in the past
Repo characteristics: No. of open issues, no. of watchers, past acceptance rate of PRs
Discussion characteristics: No. of comments, no. of participants, average sentiment of all comments
After feature engineering, the following features were considered:
No. of commits
No. of comments
Length of PR in days
Relationship between PR creator and repo (member, collaborator, etc)
Average sentiment score of comments
Watch count of repo
Past acceptance ratio of a PR in the repo
pull_request_analysis_worker/
├── __init__.py
├── pull_request_analysis_worker.py
├── runtime.py
├── setup.py
└── trained_pr_model.pkl
The trained_pr_model.pkl
is the saved pre-trained model, used for the analysis.
After prediction, the pull_request_analysis
table is populated with the predicted probabilities for every open PR.