GraphRepo documentation¶
Overview & Installation¶
GraphRepo is a tool that indexes Git repositories in Neo4j, and allows to query and aggregate the data. Under the hood it uses PyDriller to parse the data from a repository.
Requirements¶
- Python 3.4 (or newer)
- Neo4j 3
- Docker (Optional) - we recommend to use Docker for Neo4j (as indicated below)
Installation - clone source code (dev version)¶
The latest development version can be cloned from Github:
$ git clone --recurse-submodules https://github.com/NullConvergence/GraphRepo
$ cd graphrepo
Install the requirements:
$ pip install -r requirements.txt
Run a docker instance with Neo4j:
$ docker run -p 7474:7474 -p 7687:7687 -v $HOME/neo4j/data:/data -v $HOME/neo4j/plugins:/plugins -e NEO4JLABS_PLUGINS=\[\"apoc\"\] -e NEO4J_AUTH=neo4j/neo4jj neo4j:3.5.11
Run the tests:
$ pytest
Or see the Examples.
Configuration¶
For any activity, GraphRepo uses a yaml (.yml) configuration with 2 objects:
- a Neo4j instance configuration, and
- a repository configuration,
as follows:
neo:
db_url: localhost # the url for the Neo4j database
port: 7687 # the Neo4j port
db_user: neo4j # Neo4j authentication username
db_pwd: neo4jj # Neo4j authentication password
batch_size: 100 # the batch size for inserting the records in Neo4j - this setting depends on the Neo4j resources
project:
repo: "repos/graphrepo/" # the repository filepath
start_date: "1 February, 2018" # the start date for indexing (leave empty if it corresponds with the initial start date of the project)
end_date: "30 March, 2018" # the start date for indexing (leave empty if it corresponds with the last commit)
project_id: "graphrepo" # a unique project id for the database
index_code: False # boolean, if True GraphRepo indexes for each file touched by a commit the source code before and after the commit. This parameter significantly increases the index time and the hardware resources needed for Neo4j. For a medium size project, with 4000 commits, with an average of 1 file edited/commit, the equivalent of 8000 files will be stored in text in Neo4j if this parameter is set to True.
index_developer_email: True # boolean, if True, GraphRepo indexes the developer emails in the Developer node. Turn flag off for GDPR or any other privacy concerns
Neo4j configuration¶
GraphRepo connects to Neo4j using the Bold REST API from py2neo. Currently the only attributes needed to connect to Neo4j are the url+port and the authentication credentials. All other configurations (e.g., setting the user permissions) are done on the database side.
Repository configuration¶
In order to insert a repository in the database, it has to be cloned on the local machine (where GraphRepo will run).
Afterwards, it can be linked with GraphRepo using the project.repo
attribute in the config file.
If one does not want to use all the repository data (e.g., if the repository is very large), it can configure
the index dates using the project.start_date
and project.end_date
attributes.
The project.project_id
attribute is used to give each project a unique identifier.
Currently, GraphRepo indexes all repositories in the same database, in order to allow information about teams of developers that work
on distinct projects to be mined without merging databases.
The project.index_code
attribute decides if GraphRepo indexes, for each file touched by a commit, the source code before and after the commit.
This parameter significantly increases the index time and the hardware resources needed for Neo4j.
For a medium size project, with 4000 commits, with an average of 1 file edited/commit, the equivalent of 8000 files will be stored in text in Neo4j if this parameter is set to True.
For examples of config files, see the projects repository, examples/configs/pydriller.yml
.
Architecture¶
GraphRepo consists of 3 main components:
- Drillers - components used to parse data from a git repository and insert records in Neo4j,
- Miners and MinerManager - components which hold default queries and interfaces for retrieving data from Neo4j, and
- Mappers - components used to transform the data retrieved by Miners in specific format, filter or sort data.
The advantage of using custom mappers is that the load on Neo4j can be decreased, using lighter queries to extract the data and more intensive data processing in the custom mappers. For example, one can write a mapper using PySpark on raw data extracted from Neo4j and use the Apache Spark engine for scalability.
Specific information about each component can be found using the links above.
Schema¶
The resulting Neo4j schema consists of 5 node types and 6 relationship types, as illustrated below:
Nodes¶
Branch¶
Each branch identified by PyDriller is indexed as a node with the following attributes:
{
"hash": "string - unique identifier",
"project_id": "string - project id from config (can be used to select all branches from a project)",
"name": "string - branch name",
}
Commit¶
Each commit is indexed as a node with the following attributes:
{
"hash": "string - unique identifier in Neo4j",
"commit_hash": "string - commit hash in git",
"message": "string - commit message in git",
"is_merge": "int - 1 if the commit is merge, 0 otherwise",
"timestamp": "int - Unix epoch, time of the commit",
"project_id": "string - project id from config (can be used to select all branches from a project)",
"dmm_unit_complexity": "int, see Pydriller",
"dmm_unit_interfacing": "int, see Pydriller",
"dmm_unit_size": "int, see Pydriller"
}
Developer¶
Each developer is indexed as a node with the following attributes:
{
"hash": "string - unique identifier",
"name": "string - developer name as in git",
"email": "string - developer email as in git",
}
Currently the mail and email information is not anonymized.
File¶
Each file is indexed as a node with the following attributes:
{
"hash": "string - unique identifier",
"name": "string - file short name as in git",
"project_id": "string - project id from config (can be used to select all branches from a project)",
"type": "string - file extension, e.g., '.py'"
}
Method¶
Each method is indexed as a node with the following attributes:
{
"hash": "string - unique identifier",
"name": "string - method name as in file",
"file_name": "string - parent file name",
"project_id": "string - project id from config (can be used to select all branches from a project)",
"type": "string - file extension, e.g., '.py'"
}
Relationships¶
Author¶
An Author relationship exists between each commit and its author. The direction is from Commit to Author and the relationship attributes are:
{
"timestamp": "int - Unix epoch, time of the commit"
}
BranchCommit¶
A BranchCommit relationship exists between each branch and the branch commits. The direction is from Branch to Commit. This relationship does not have any special attributes.
Method¶
An Method relationship exists between each file and its methods. The direction is from File to Method. This relationship does not have any special attributes. In order to find out if the method is still part of the file or it was deleted, we can use the FileMiner.
Parent¶
A parent relationship exists between each commit its parent/parents. This relationship does not have any special attributes.
UpdateFile¶
An UpdateFile relationship exists between a commit that edited a file and the edited file. The direction is from Commit to File and the relationship attributes are:
{
"timestamp": "int - Unix epoch, time of the commit",
"old_path": "string - old path, if the file was moved (see type attribute)",
"path": "string - current file path",
"diff": "string - commit diff",
"source_code": "string - source code after the commit",
"source_code_before": "string - source before after the commit",
"nloc": "int - file lines of code after the commit",
"complexity": "int - file complexity after the commit",
"token_count": "int - number of tokens after the commit",
"added": "int - number of lines added in commit",
"removed": "int - number of lines removed in commit",
"type": "string - type of update. Possible values are: 'ADD', 'COPY', 'RENAME', 'DELETE', 'MODIFY', 'UNKNOWN' "
}
UpdateMethod¶
An UpdateMethod relationship exists between a commit that edited a method and the edited method. The direction is from Commit to Method and the relationship attributes are:
{
"timestamp": "int - Unix epoch, time of the commit",
"long_name": "string - method long name, including parameters",
"parameters": "string - method parameters",
"complexity": "int - method complexity, after commit",
"nloc": "int - method lines of code, after commit",
"fan_in": "int - method fan in, after commit",
"fan_out": "int - method fan out, after commit",
"general_fan_out": "int -method general fan out, after commit",
"length": "int -method general fan out, after commit",
"token_count": "int -method nr of tokens, after commit",
"start_line": "int -method start line, after commit",
"end_line": "int -method end line, after commit",
}
Drillers¶
All Drillers parse a repository and insert it in Neo4j. Under the hood all drillers uses PyDriller to extract data from a repository.
Drillers perform the following activities. Given a config file, they:
- establish a connection to Neo4j (or raise an exception if the connection fails),
- parse the data from PyDriller,
- insert the data in Neo4j.
Currently there are 3 drillers available:
- Driller - default driller that stores the data parsed from the repository in RAM memory.
- CacheDriller - stores the data parsed from the repository on disk (thus saving RAM memory at the cost of more disk writes and decreased performance).
- QueueDriller - stores the data parsed from a repository to a queue. Currently it supports RabbitMQ and Artemis. Please take note that two drillers must be used in case of a queue: (i) one that parses the data from Git repos and (ii) one that indexes the data in Neo4j.
The queue driller is the most scalable one since it allows to have multiple instances for indexing. Thus it solves some scalability issues (e.g., PyDriller is single threaded).
In order to index the data, you will need a config file (see Configuration) and the following code:
from graphrepo.drillers.drillers import Driller
# Initialize the database indexes
try:
driller.init_db()
except Exception as exc:
print("DB already initialized")
# configure driller
driller = Driller(config_path='path-to-yaml-config-file.yml')
# drill (extract data and store it in Neo4j)
driller.drill_batch()
# merge duplicate nodes
driller.merge_all()
For a complete example, see Examples.
Miners¶
Miners are special classes which hold default Neo4j queries that can be used to extract data. At the moment, there are 4 standard miners, specific to the most important node entities in the graph:
CommitMiner
- default queries for commits (including relationships to other nodes),DeveloperMiner
- default queries for developers (including relationships to other nodes),FileMiner
- default queries for files (including relationships to other nodes),MethodMiner
- default queries for methods (including relationships to other nodes),
and a MineManager
, which initializes and configures all miners.
We recommend to always use the MineManager
for initialization, since there is no overhead over initializing only one miner.
Using a config file (see Configuration), the Minemanager
can be initialized as follows:
from graphrepo.miners import MineManager
# initialize mine manager
miner = MineManager(config_path=args.config)
# The specific miners can now be accessed as:
miner.commit_miner.get_all()
miner.dev_miner.get_all()
miner.file_miner.get_all()
miner.method_miner.get_all()
Examples¶
In the project’s repository there are many examples on how to use GraphRepo to index and mine data.
Please note that in order to run the plotting examples you have to install pandas
and plotly
, for example using pip:
$ pip install pandas
1. Index data¶
In this example, we index all data from PyDriller in Neo4j. The example assumes you are running a Neo4j instance in Docker, as indicated in Configuration.
In order to run the example, clone the projects using the following commands:
$ git clone --recurse-submodules https://github.com/NullConvergence/GraphRepo
$ cd graphrepo
$ mkdir repos
$ cd repos
$ git clone https://github.com/ishepard/pydriller
In this step we cloned the GraphRepo project, which includes the example scripts to run and the PyDriller project, which we want to experiment with.
In order to run the indexing example, make sure to configure the config file in examples/configs/pydriller.yml
and set the neo
object to your database settings.
Then run:
$ python -m examples.index_all --config=examples/config/pydriller.yml
After indexing finishes, you can go to http://<database-url>:7474/browser/
and explore the project, with a query like: MATCH (n) RETURN n
.
2. Retrieve all data¶
This step assumes you already indexed the PyDriller repository in Neo4j, as indicated at Step 1. In order to retrieve all information for PyDriller, we can run the following example:
$ python -m examples.mine_all --config=examples/config/pydriller.yml
This script will print the number of nodes indexed in the database.
3. Plot file complexity over time¶
This step assumes you already indexed the PyDriller repository
in Neo4j, as indicated at Step 1.
In this example we will use the miners to retrieve a file and
plot its complexity evolution over time.
The file used is examples/file_complexity.py
.
The complexity is stored in the UpdateFile
relationship (see Schema).
The get_change_history
from the File
miner retrieves all the UpdateFile
relationships that point to the file.
For plotting, in the example we map the data to a pandas DataFrame and use Plotly, although any other libraries can be used.
In order to display the plot, run:
$ python -m examples.file_complexity --config=examples/configs/pydriller.yml
3. Plot file methods complexity over time¶
This step assumes you already indexed the PyDriller repository
in Neo4j, as indicated at Step 1.
In this example we will use the miners to retrieve and plot the complexity
evolution over time of all methods in a file.
The file used is examples/all_method_complexity.py
.
The complexity is stored in the UpdateFile
relationship (see Data Structure).
We first get all the methods for a file, then, for each method, we get the
update information as in Step 2.
For plotting, in the example we map the data to a pandas DataFrame and use Plotly, although any other libraries can be used.
In order to display the plot, run:
$ python -m examples.all_method_complexity --config=examples/configs/pydriller.yml