KGTorrent package¶
kgtorrent¶
-
KGTorrent.kgtorrent.
main
()¶ Entry-point function for KGTorrent. It orchestrates function/method calls to build and populate the KGTorrent database and dataset.
config¶
This is the configuration file of KGTorrent.
Here the main variables of the program are set, mostly by reading their values from environment variables.
See Configuration for details on the environment variables that must be set to run KGTorrent.
db_communication_handler¶
This module defines the class that handles the communication with the database via SQLAlchemy.
-
class
KGTorrent.db_communication_handler.
DbCommunicationHandler
(db_username, db_password, db_host, db_port, db_name)¶ Bases:
object
This class creates an SQLAlchemy engine and has methods for creating, populating and querying a database with MetaKaggle data.
-
create_new_db
(drop_if_exists=False)¶ This method creates a database with the provided name and builds schemas of MetaKaggle tables. It throws by default an exception when the database already exists in order to avoid an initialization called by mistake.
- Parameters
drop_if_exists – If True the database is dropped before creation. By default it is False.
-
db_exists
()¶ This method checks whether the database exists.
- Returns
True if the database exists, False otherwise.
- Return type
bool
-
get_nb_identifiers
(languages)¶ This method queries the database in order to retrieve slugs and identifiers of notebooks written in the provided languages.
- Parameters
languages – A string array of notebook languages present in Kaggle.
- Returns
The
pandas.DataFrame
containing notebook slugs and identifiers.- Return type
nb_identifiers
-
set_foreign_keys
(constraints_df)¶ This method sets the foreign key constraints based on information provided by the related
pandas.DataFrame
.- Parameters
constraints_df – The
pandas.DataFrame
which contains the foreign key constraints information
-
write_tables
(tables_dict)¶ This method writes tables to the database by using the
pandas.DataFrame.to_sql
method.- Parameters
tables_dict – The dictionary whose keys are the table names and whose values are the
pandas.DataFrame
tables.
-
downloader¶
This module defines the class that handles the actual download of Jupyter notebooks from Kaggle.
-
class
KGTorrent.downloader.
Downloader
(nb_identifiers, nb_archive_path)¶ Bases:
object
The
Downloader
class handles the download of Jupyter notebooks from Kaggle. It needs the notebook slugs and identifierspandas.DataFrame
in order to request notebooks from Kaggle. To do so it uses one of the following two strategies:HTTP
to download full notebooks via HTTP requests;
API
to download notebooks via calls to the official Kaggle API; Jupyter notebooks downloaded by using this strategy always miss the output of code cells.
Notebooks that are already present in the download folder are skipped. During the
refresh
procedure all those notebooks that are already present in the download folder but are no longer referenced in the KGTorrent database are deleted.-
download_notebooks
(strategy='HTTP')¶ This method executes the download procedure using the provided strategy after checking the destination folder.
- Parameters
strategy – The download strategy (
HTTP
orAPI
). By default it isHTTP
.
exceptions¶
This module defines exceptions that might be risen while building/updating KGTorrent using this package.
-
exception
KGTorrent.exceptions.
DatabaseExistsError
(message)¶ Bases:
KGTorrent.exceptions.Error
Exception raised when the database name already exists and has not been forced to remove.
-
message
¶ short message containing the explanation of the error.
- Type
str
-
-
exception
KGTorrent.exceptions.
Error
¶ Bases:
Exception
Base class for exceptions in this package.
-
exception
KGTorrent.exceptions.
TableNotPreprocessedError
(message)¶ Bases:
KGTorrent.exceptions.Error
Exception raised when the table that was just read was supposed to be already preprocessed (and serialized to .bz2 pickle file) but is not.
-
message
¶ short message containing the explanation of the error.
- Type
str
-