Insight_worker
It uses Augur’s metrics API to discover insights for every time_series metrics like issues, reviews, code-changes, code-changes-lines etc.. of every repos present in the database.
We used BiLSTM(Bi-directional Long Short Term Memory)model as it is capable of capturing trend, long-short seasonality in the data. A bidirectional LSTM (BiLSTM) layer learns bidirectional long-term dependencies between time steps of time series or sequence data. These dependencies can be useful when you want the network to learn from the complete time series at each time step.
Worker Configuration
Worker has three main configurations that are standard across all workers.And it also have few more configurations that are mainly for Machine Learning model inside the worker.
The standard options are:
switch, a boolean flag indicating if the worker should automatically be started with Augur. Defaults to0(false).workers, the number of instances of this worker that Augur should spawn ifswitchis set to1. Defaults to1.port, which is the base TCP port the worker will use t0 communicate with Augur’s broker. The default is different for each worker, for theinsight_workerit is21311.
Keeping workers at 1 should be fine for small collection sets, but if you have a lot of repositories to collect data for, you can raise it. We also suggest double checking that the default worker ports are free on your machine.
Configuration for ML models are:
We recommend leaving the defaults in place for the insight worker unless you interested in other metrics, or anomalies for a different time period.
training_days, which specifies the date range that theinsight_workershould use as its baseline for the statistical comparison. Defaults to365, meaning that the worker will identify metrics that have had anomalies compared to their values over the course of the past year, starting at the current date.anomaly_days, which specifies the date range in which theinsight_workershould look for anomalies. Defaults to14, meaning that the worker will detect anomalies that have only occured within the past fourteen days, starting at the current date.contamination, which is the “sensitivity” parameter for detecting anomalies. Acts as an estimated percentage of the training_days that are expected to be anomalous. The default is0.1for the default training days of 365: 10% of 365 days means that about 36 data points of the 365 days are expected to be anomalous.metrics, which specifies which metrics theinsight_workershould run the anomaly detection algorithm on. This is structured like so:[ 'endpoint_name_1', 'endpoint_name_1', 'endpoint_name_2', ... ] # defaults to the following [ "issues-new", "code-changes", "code-changes-lines", "reviews", "contributors-new" ]
Methods inside the Insight_model
time_series_metrics:It takes parametersentry_info,repo_id.Collects data of different metrics using API endpoints.Preprocesses data and creates a dataframe with date and each and every fields of the given endpoints as columns.Then this method calls another method that islstm_selection.Structure of the dataframe is as follows:
df>> index date endpoints1 _ field endpoints2 _ field 0. 2020-03-20 5 8
lstm_selection:This method takesentry_info,repo_id,dfas parameters.It Selects window_size or time_steps by checking sparsity and coefficient of variation in data which is passed into thelstm_kerasmethod.preprocess_data:This method is called by thelstm_kerasmethod withdata,tr_days,lback_days,n_features,n_predaysas parameters.It arranges training_data according to different parameters passed bylstm_kerasmethod for theBiLSTMmodel.It returns two variablesfeatures_setandlabelswith the following structure:
features_set = [ [[1], [2]], [[2], [3]], [[3], [4]] ] labels = [ [3],[4],[5] ] #tr_days : number of training days(it is not equal to the training days passed into the configuration) #lback_days : number of days to lookback for next-day prediction #n_features : number of features of columns in dataframe for training #n_predays : next number of days to predict for each entry in features_set #tr_days = training_days - anomaly_days (in configuration) #here tr_days = 4, black_days = 2, n_features = 1, n_predays = 1
lstm_model:It is the configuration of the multipleBiLSTMlayers along with singledenselayer and optimisers.This method called inside thelstm_kerasmethod withfeatures_set,n_predays,n_featuresas parameters.Configuation of the model is as follows:
model = Sequential() model.add(Bidirectional(LSTM(90, activation='linear',return_sequences=True,input_shape=(features_set.shape[1], n_features)))) model.add(Dropout(0.2)) model.add(Bidirectional(LSTM(90, activation='linear',return_sequences=True))) model.add(Dropout(0.2)) model.add(Bidirectional(LSTM(90, activation='linear'))) model.add(Dense(1)) model.add(Activation('linear')) model.compile(optimizer='adam', loss='mae')This configuration is designed to acheive the best possible results for all kind of metrics.
lstm_keras:This is the most important method in theinsights_modelcalled by thelstm_selectionmethod withentry_info,repo_id, anddataframeas parameters.Here dataframe consists of two columns, one isdateand another one isendpoint1 _ field.In this method model is trained ontr_daysdata and values were predicted foranomaly_daysdata.Baesd on the difference on actual and predicted values outliers were discovered.
If any outliers discovered between the
anomaly_daysthen those points will be inserted into to therepo_insightsandrepo_insights_recordstable by callinginsert_datamethod.Before calling the
insert_datamethod, performance of model on the training as well as test data will be evaluated and its summary will be inserted into thelstm_anomaly_resultstable along with the unique model configuration into thelstm_anomaly_modelstable.
insert_data:It is called by thelstm_kerasmethod withentry_info,repo_id,anomaly_df,modelas parameters.Hereanomaly_dfis the dataframe which consists of points which are classified as outliers between theanomaly_days.
Insights_model consists of multiple independent methods like time_series_metrics, insert_data etc..These methods can be used independently with other Machine Learning models.Also preprocess_data, model_lstm methods can be easily modified according to the different LSTM networks configuration.