Insight_worker

It uses Augur’s metrics API to discover insights for every time_series metrics like issues, reviews, code-changes, code-changes-lines etc.. of every repos present in the database.

We used BiLSTM(Bi-directional Long Short Term Memory)model as it is capable of capturing trend, long-short seasonality in the data. A bidirectional LSTM (BiLSTM) layer learns bidirectional long-term dependencies between time steps of time series or sequence data. These dependencies can be useful when you want the network to learn from the complete time series at each time step.

Worker Configuration

Worker has three main configurations that are standard across all workers.And it also have few more configurations that are mainly for Machine Learning model inside the worker.

The standard options are:

switch, a boolean flag indicating if the worker should automatically be started with Augur. Defaults to 0 (false).
workers, the number of instances of this worker that Augur should spawn if switch is set to 1. Defaults to 1.
port, which is the base TCP port the worker will use t0 communicate with Augur’s broker. The default is different for each worker, for the insight_worker it is 21311.

Keeping workers at 1 should be fine for small collection sets, but if you have a lot of repositories to collect data for, you can raise it. We also suggest double checking that the default worker ports are free on your machine.

Configuration for ML models are:

We recommend leaving the defaults in place for the insight worker unless you interested in other metrics, or anomalies for a different time period.

training_days, which specifies the date range that the insight_worker should use as its baseline for the statistical comparison. Defaults to 365, meaning that the worker will identify metrics that have had anomalies compared to their values over the course of the past year, starting at the current date.
anomaly_days, which specifies the date range in which the insight_worker should look for anomalies. Defaults to 14, meaning that the worker will detect anomalies that have only occured within the past fourteen days, starting at the current date.
contamination, which is the “sensitivity” parameter for detecting anomalies. Acts as an estimated percentage of the training_days that are expected to be anomalous. The default is 0.1 for the default training days of 365: 10% of 365 days means that about 36 data points of the 365 days are expected to be anomalous.

metrics, which specifies which metrics the insight_worker should run the anomaly detection algorithm on. This is structured like so:

[
    'endpoint_name_1',
    'endpoint_name_1',
    'endpoint_name_2',
    ...
]

# defaults to the following

[
    "issues-new",
    "code-changes",
    "code-changes-lines",
    "reviews",
    "contributors-new"
]

Methods inside the Insight_model

time_series_metrics:It takes parameters entry_info , repo_id .Collects data of different metrics using API endpoints.Preprocesses data and creates a dataframe with date and each and every fields of the given endpoints as columns.Then this method calls another method that is lstm_selection.Structure of the dataframe is as follows:

df>>
index   date          endpoints1 _ field     endpoints2 _ field
0.      2020-03-20    5                      8

lstm_selection:This method takes entry_info, repo_id, df as parameters.It Selects window_size or time_steps by checking sparsity and coefficient of variation in data which is passed into the lstm_keras method.
preprocess_data:This method is called by the lstm_keras method with data, tr_days, lback_days, n_features, n_predays as parameters.It arranges training_data according to different parameters passed by lstm_keras method for the BiLSTM model.It returns two variables features_set and labels with the following structure:

features_set = [
    [[1],
     [2]],
    [[2],
     [3]],
    [[3],
     [4]]
]
labels = [ [3],[4],[5] ]

#tr_days : number of training days(it is not equal to the training days passed into the configuration)
#lback_days : number of days to lookback for next-day prediction
#n_features : number of features of columns in dataframe for training
#n_predays : next number of days to predict for each entry in features_set


#tr_days = training_days - anomaly_days   (in configuration)

#here
tr_days = 4,
black_days = 2,
n_features = 1,
n_predays = 1

lstm_model:It is the configuration of the multiple BiLSTM layers along with single dense layer and optimisers.This method called inside the lstm_keras method with features_set, n_predays, n_features as parameters.Configuation of the model is as follows:

model = Sequential()
model.add(Bidirectional(LSTM(90, activation='linear',return_sequences=True,input_shape=(features_set.shape[1], n_features))))
model.add(Dropout(0.2))
model.add(Bidirectional(LSTM(90, activation='linear',return_sequences=True)))
model.add(Dropout(0.2))
model.add(Bidirectional(LSTM(90, activation='linear')))
model.add(Dense(1))
model.add(Activation('linear'))
model.compile(optimizer='adam', loss='mae')

This configuration is designed to acheive the best possible results for all kind of metrics.

lstm_keras:This is the most important method in the insights_model called by the lstm_selection method with entry_info , repo_id, and dataframe as parameters.Here dataframe consists of two columns, one is date and another one is endpoint1 _ field .In this method model is trained on tr_days data and values were predicted for anomaly_days data.Baesd on the difference on actual and predicted values outliers were discovered.

If any outliers discovered between the anomaly_days then those points will be inserted into to the repo_insights and repo_insights_records table by calling insert_data method.

Before calling the insert_data method, performance of model on the training as well as test data will be evaluated and its summary will be inserted into the lstm_anomaly_results table along with the unique model configuration into the lstm_anomaly_models table.

insert_data:It is called by the lstm_keras method with entry_info, repo_id, anomaly_df, model as parameters.Here anomaly_df is the dataframe which consists of points which are classified as outliers between the anomaly_days.

Insights_model consists of multiple independent methods like time_series_metrics, insert_data etc..These methods can be used independently with other Machine Learning models.Also preprocess_data, model_lstm methods can be easily modified according to the different LSTM networks configuration.