KGTorrent package

kgtorrent

KGTorrent.kgtorrent.main()

Entry-point function for KGTorrent. It orchestrates function/method calls to build and populate the KGTorrent database and dataset.

config

This is the configuration file of KGTorrent.

Here the main variables of the program are set, mostly by reading their values from environment variables.

See Configuration for details on the environment variables that must be set to run KGTorrent.

db_communication_handler

This module defines the class that handles the communication with the database via SQLAlchemy.

class KGTorrent.db_communication_handler.DbCommunicationHandler(db_username, db_password, db_host, db_port, db_name)

Bases: object

This class creates an SQLAlchemy engine and has methods for creating, populating and querying a database with MetaKaggle data.

create_new_db(drop_if_exists=False)

This method creates a database with the provided name and builds schemas of MetaKaggle tables. It throws by default an exception when the database already exists in order to avoid an initialization called by mistake.

Parameters

drop_if_exists – If True the database is dropped before creation. By default it is False.

db_exists()

This method checks whether the database exists.

Returns

True if the database exists, False otherwise.

Return type

bool

get_nb_identifiers(languages)

This method queries the database in order to retrieve slugs and identifiers of notebooks written in the provided languages.

Parameters

languages – A string array of notebook languages present in Kaggle.

Returns

The pandas.DataFrame containing notebook slugs and identifiers.

Return type

nb_identifiers

set_foreign_keys(constraints_df)

This method sets the foreign key constraints based on information provided by the related pandas.DataFrame.

Parameters

constraints_df – The pandas.DataFrame which contains the foreign key constraints information

write_tables(tables_dict)

This method writes tables to the database by using the pandas.DataFrame.to_sql method.

Parameters

tables_dict – The dictionary whose keys are the table names and whose values are the pandas.DataFrame tables.

downloader

This module defines the class that handles the actual download of Jupyter notebooks from Kaggle.

class KGTorrent.downloader.Downloader(nb_identifiers, nb_archive_path)

Bases: object

The Downloader class handles the download of Jupyter notebooks from Kaggle. It needs the notebook slugs and identifiers pandas.DataFrame in order to request notebooks from Kaggle. To do so it uses one of the following two strategies:

HTTP

to download full notebooks via HTTP requests;

API

to download notebooks via calls to the official Kaggle API; Jupyter notebooks downloaded by using this strategy always miss the output of code cells.

Notebooks that are already present in the download folder are skipped. During the refresh procedure all those notebooks that are already present in the download folder but are no longer referenced in the KGTorrent database are deleted.

download_notebooks(strategy='HTTP')

This method executes the download procedure using the provided strategy after checking the destination folder.

Parameters

strategy – The download strategy (HTTP or API). By default it is HTTP.

exceptions

This module defines exceptions that might be risen while building/updating KGTorrent using this package.

exception KGTorrent.exceptions.DatabaseExistsError(message)

Bases: KGTorrent.exceptions.Error

Exception raised when the database name already exists and has not been forced to remove.

message

short message containing the explanation of the error.

Type

str

exception KGTorrent.exceptions.Error

Bases: Exception

Base class for exceptions in this package.

exception KGTorrent.exceptions.TableNotPreprocessedError(message)

Bases: KGTorrent.exceptions.Error

Exception raised when the table that was just read was supposed to be already preprocessed (and serialized to .bz2 pickle file) but is not.

message

short message containing the explanation of the error.

Type

str