API Reference

GitRepository

This module includes 1 class, GitRepository, representing a repository in Git.

class pydriller.git_repository.GitRepository(path: str, conf=None)

Class representing a repository in Git. It contains most of the logic of PyDriller: obtaining the list of commits, checkout, reset, etc.

__init__(path: str, conf=None)

Init the Git RepositoryMining.

Parameters:path (str) – path to the repository
__module__ = 'pydriller.git_repository'
checkout(_hash: str) → None

Checkout the repo at the speficied commit. BE CAREFUL: this will change the state of the repo, hence it should not be used with more than 1 thread.

Parameters:_hash – commit hash to checkout
files() → List[str]

Obtain the list of the files (excluding .git directory).

Returns:List[str], the list of the files
get_commit(commit_id: str) → pydriller.domain.commit.Commit

Get the specified commit.

Parameters:commit_id (str) – hash of the commit to analyze
Returns:Commit
get_commit_from_gitpython(commit: git.objects.commit.Commit) → pydriller.domain.commit.Commit

Build a PyDriller commit object from a GitPython commit object. This is internal of PyDriller, I don’t think users generally will need it.

Parameters:commit (GitCommit) – GitPython commit
Returns:Commit commit: PyDriller commit
get_commit_from_tag(tag: str) → pydriller.domain.commit.Commit

Obtain the tagged commit.

Parameters:tag (str) – the tag
Returns:Commit commit: the commit the tag referred to
get_commits_last_modified_lines(commit: pydriller.domain.commit.Commit, modification: pydriller.domain.commit.Modification = None, hyper_blame: bool = False, hashes_to_ignore_path: str = None) → Dict[str, Set[str]]

Given the Commit object, returns the set of commits that last “touched” the lines that are modified in the files included in the commit. It applies SZZ.

IMPORTANT: for better results, we suggest to install Google depot_tools first (see https://dev.chromium.org/developers/how-tos/install-depot-tools). This allows PyDriller to use “git hyper-blame” instead of the normal blame. If depot_tools are not installed, PyDriller will automatically switch to the normal blame.

The algorithm works as follow: (for every file in the commit)

1- obtain the diff

2- obtain the list of deleted lines

3- blame the file and obtain the commits were those lines were added

Can also be passed as parameter a single Modification, in this case only this file will be analyzed.

Parameters:
  • commit (Commit) – the commit to analyze
  • modification (Modification) – single modification to analyze
  • hyper_blame (bool) – whether to use git hyper blame or the normal blame (by default it uses the normal blame).
  • hashes_to_ignore_path (str) – path to a file containing hashes of commits to ignore. (only works with git hyper blame)
Returns:

the set containing all the bug inducing commits

get_commits_modified_file(filepath: str) → List[str]

Given a filepath, returns all the commits that modified this file (following renames).

Parameters:filepath (str) – path to the file
Returns:the list of commits’ hash
get_head() → pydriller.domain.commit.Commit

Get the head commit.

Returns:Commit of the head commit
get_list_commits(branch: str = None, reverse_order: bool = True) → Generator[[pydriller.domain.commit.Commit, None], None]

Return a generator of commits of all the commits in the repo.

Returns:Generator[Commit], the generator of all the commits in the repo
get_tagged_commits()

Obtain the hash of all the tagged commits.

Returns:list of tagged commits (can be empty if there are no tags)
git

GitPython object Git.

Returns:Git
parse_diff(diff: str) → Dict[str, List[Tuple[int, str]]]

Given a diff, returns a dictionary with the added and deleted lines. The dictionary has 2 keys: “added” and “deleted”, each containing the corresponding added or deleted lines. For both keys, the value is a list of Tuple (int, str), corresponding to (number of line in the file, actual line).

Parameters:diff (str) – diff of the commit
Returns:Dictionary
repo

GitPython object Repo.

Returns:Repo
reset() → None

Reset the state of the repo, checking out the main branch and discarding local changes (-f option).

total_commits() → int

Calculate total number of commits.

Returns:the total number of commits

RepositoryMining

This module includes 1 class, RepositoryMining, main class of PyDriller.

class pydriller.repository_mining.RepositoryMining(path_to_repo: Union[str, List[str]], single: str = None, since: datetime.datetime = None, to: datetime.datetime = None, from_commit: str = None, to_commit: str = None, from_tag: str = None, to_tag: str = None, reversed_order: bool = False, only_in_branch: str = None, only_modifications_with_file_types: List[str] = None, only_no_merge: bool = False, only_authors: List[str] = None, only_commits: List[str] = None, only_releases: bool = False, filepath: str = None, histogram_diff: bool = False, skip_whitespaces: bool = False)

This is the main class of PyDriller, responsible for running the study.

__init__(path_to_repo: Union[str, List[str]], single: str = None, since: datetime.datetime = None, to: datetime.datetime = None, from_commit: str = None, to_commit: str = None, from_tag: str = None, to_tag: str = None, reversed_order: bool = False, only_in_branch: str = None, only_modifications_with_file_types: List[str] = None, only_no_merge: bool = False, only_authors: List[str] = None, only_commits: List[str] = None, only_releases: bool = False, filepath: str = None, histogram_diff: bool = False, skip_whitespaces: bool = False)

Init a repository mining. The only required parameter is “path_to_repo”: to analyze a single repo, pass the absolute path to the repo; if you need to analyze more repos, pass a list of absolute paths.

Furthermore, PyDriller supports local and remote repositories: if you pass a path to a repo, PyDriller will run the study on that repo; if you pass an URL, PyDriller will clone the repo in a temporary folder, run the study, and delete the temporary folder.

Parameters:
  • path_to_repo (Union[str,List[str]]) – absolute path (or list of absolute paths) to the repository(ies) to analyze
  • single (str) – hash of a single commit to analyze
  • since (datetime) – starting date
  • to (datetime) – ending date
  • from_commit (str) – starting commit (only if since is None)
  • to_commit (str) – ending commit (only if to is None)
  • from_tag (str) – starting the analysis from specified tag (only if since and from_commit are None)
  • to_tag (str) – ending the analysis from specified tag (only if to and to_commit are None)
  • reversed_order (bool) – whether the commits should be analyzed in reversed order
  • only_in_branch (str) – only commits in this branch will be analyzed
  • only_modifications_with_file_types (List[str]) – only modifications with that file types will be analyzed
  • only_no_merge (bool) – if True, merges will not be analyzed
  • only_authors (List[str]) – only commits of these authors will be analyzed (the check is done on the username, NOT the email)
  • only_commits (List[str]) – only these commits will be analyzed
  • filepath (str) – only commits that modified this file will be analyzed
__module__ = 'pydriller.repository_mining'
traverse_commits() → Generator[[pydriller.domain.commit.Commit, None], None]

Analyze all the specified commits (all of them by default), returning a generator of commits.

Commit

This module contains all the classes regarding a specific commit, such as Commit, Modification, ModificationType and Method.

class pydriller.domain.commit.Commit(commit: git.objects.commit.Commit, conf)

Class representing a Commit. Contains all the important information such as hash, author, dates, and modified files.

__init__(commit: git.objects.commit.Commit, conf) → None

Create a commit object.

Parameters:
  • commit – GitPython Commit object
  • conf – Configuration class
__module__ = 'pydriller.domain.commit'
author

Return the author of the commit as a Developer object.

Returns:author
author_date

Return the authored datetime.

Returns:datetime author_datetime
author_timezone

Author timezone expressed in seconds from epoch.

Returns:int timezone
branches

Return the set of branches that contain the commit.

Returns:set(str) branches
committer

Return the committer of the commit as a Developer object.

Returns:committer
committer_date

Return the committed datetime.

Returns:datetime committer_datetime
committer_timezone

Author timezone expressed in seconds from epoch.

Returns:int timezone
hash

Return the SHA of the commit.

Returns:str hash
in_main_branch

Return True if the commit is in the main branch, False otherwise.

Returns:bool in_main_branch
merge

Return True if the commit is a merge, False otherwise.

Returns:bool merge
modifications

Return a list of modified files.

Returns:List[Modification] modifications
msg

Return commit message.

Returns:str commit_message
parents

Return the list of parents SHAs.

Returns:List[str] parents
project_name

Return the project name.

Returns:project name
class pydriller.domain.commit.Method(func)

This class represents a method in a class. Contains various information extracted through Lizard.

__init__(func)

Initialize a method object. This is calculated using Lizard: it parses the source code of all the modifications in a commit, extracting information of the methods contained in the file (if the file is a source code written in one of the supported programming languages).

__module__ = 'pydriller.domain.commit'
class pydriller.domain.commit.Modification(old_path: str, new_path: str, change_type: pydriller.domain.commit.ModificationType, diff_and_sc: Dict[str, str])

This class contains information regarding a modified file in a commit.

__init__(old_path: str, new_path: str, change_type: pydriller.domain.commit.ModificationType, diff_and_sc: Dict[str, str])

Initialize a modification. A modification carries on information regarding the changed file. Normally, you shouldn’t initialize a new one.

__module__ = 'pydriller.domain.commit'
added

Return the total number of added lines in the file.

Returns:int lines_added
complexity

Calculate the Cyclomatic Complexity of the file.

Returns:Cyclomatic Complexity of the file
filename

Return the filename. Given a path-like-string (e.g. “/Users/dspadini/pydriller/myfile.py”) returns only the filename (e.g. “myfile.py”)

Returns:str filename
methods

Return the list of methods in the file. Every method contains various information like complexity, loc, name, number of parameters, etc.

Returns:list of methods
new_path

New path of the file. Can be None if the file is deleted.

Returns:str new_path
nloc

Calculate the LOC of the file.

Returns:LOC of the file
old_path

Old path of the file. Can be None if the file is added.

Returns:str old_path
removed

Return the total number of deleted lines in the file.

Returns:int lines_deleted
token_count

Calculate the token count of functions.

Returns:token count
class pydriller.domain.commit.ModificationType

Type of Modification. Can be ADD, COPY, RENAME, DELETE, MODIFY or UNKNOWN.

ADD = 1
COPY = 2
DELETE = 4
MODIFY = 5
RENAME = 3
UNKNOWN = 6
__module__ = 'pydriller.domain.commit'

Developer

This module includes only 1 class, Developer, representing a developer.

class pydriller.domain.developer.Developer(name: str, email: str)

This class represents a developer. We save the email and the name.

__init__(name: str, email: str)

Class to identify a developer.

Parameters:
  • name (str) – name and surname of the developer
  • email (str) – email of the developer
__module__ = 'pydriller.domain.developer'

Process Metrics

Module that calculates process metrics. See https://ieeexplore.ieee.org/document/6606589 for more info.

class pydriller.metrics.process.process_metrics.ProcessMetrics

This class is responsible to implement the following process metrics:

  • Commit Count: measures the number of commits made to a file
  • Distinct Developers Count: measures the cumulative number of distinct developers that contributed to a file
__module__ = 'pydriller.metrics.process.process_metrics'
commits_count(path_to_repo: str, filepath: str, from_commit: str = None, to_commit: str = None)

Return the number of commits made to a file from the first commit to the one identified by commit_hash.

Path_to_repo:path to a single repo
Commit_hash:the SHA of the commit to stop counting. If None, the analysis starts from the latest commit
Filepath:the path to the file to count for. E.g. ‘doc/README.md’
Returns:int number of commits made to the file
distinct_dev_count(path_to_repo: str, filepath: str, from_commit: str = None, to_commit: str = None)

Return the cumulative number of distinct developers contributed to the file up to the indicated commit.

Path_to_repo:path to a single repo
Commit_hash:the SHA of the commit to stop counting. If None, the SHA is the latest commit SHA
Filepath:the path to the file to count for. E.g. ‘doc/README.md’
Returns:int number of distinct developers contributing to the file