Using the collection ==================== Users interested in analyzing the KGTorrent database should download it from its Zenodo repository; it is stored as a compressed archive named ``KGT_dataset.tar.bz2``. The dataset can be analyzed as a whole, although we believe that the most interesting use case is to leverage the companion database to select a subset of notebooks based on specific research criteria. To this aim, along with the dataset, one should download the compressed archive containing the dump of the MySQL database (named ``KGTorrent_dump_10-2020.sql.tar.bz2``), uncompress it and import it into a local MySQL installation. You can use the Linux ``tar`` command to uncompress both archives:: tar -xf KGT_dataset.tar.bz2 -C /path/to/local/dataset/folder tar -xf KGTorrent_dump_10-2020.sql.tar.bz2 Then, import the MySQL dump in your local MySQL installation. To perform this step, you can follow `this guide `_. Once the database has been correctly imported, you can query it to select a subset of notebooks based on specific criteria. The Jupyter notebooks in KGTorrent are saved with filenames following this pattern: ``UserName_CurrentUrlSlug``, where ``UserName`` is a field of the ``Users`` table, while ``CurrentUrlSlug`` is a field of the ``Kernels`` table. Therefore, by including such pattern in the ``SELECT`` statement of an SQL query, the result will comprise a column listing the names of the selected Jupyter notebooks from the dataset. In the following example, I select the filenames of all Python Jupyter notebooks that have been awarded a gold medal in Kaggle: .. code-block:: mysql SELECT CONCAT(u.UserName, '_', k.CurrentUrlSlug, '.ipynb') FilteredNotebookNames FROM ((kernels k JOIN users u ON k.AuthorUserId = u.Id) JOIN kernelversions kv ON k.CurrentKernelVersionId = kv.Id) JOIN kernellanguages kl ON kv.ScriptLanguageId = kl.Id WHERE kl.name LIKE 'IPython Notebook HTML' AND k.Medal = 1; Note that ``IPython Notebook HTML`` is the language name used in Meta Kaggle to identify Jupyter notebooks written in Python. In MySQL, if you want to save the results of the query above to a ``.csv`` file, you can extend it as follows: .. code-block:: mysql SELECT CONCAT(u.UserName, '_', k.CurrentUrlSlug, '.ipynb') FilteredNotebookNames FROM ((kernels k JOIN users u ON k.AuthorUserId = u.Id) JOIN kernelversions kv ON k.CurrentKernelVersionId = kv.Id) JOIN kernellanguages kl ON kv.ScriptLanguageId = kl.Id WHERE kl.name LIKE 'IPython Notebook HTML' AND k.Medal = 1 INTO OUTFILE '/var/lib/mysql-files/query_result.csv' FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'; In the GitHub repository (at ``docs/imgs/KGTorrent_logical_schema.png``) as well as in the Zenodo repository containing the dataset, we share the logical schema underlying the KGTorrent database. We built this schema by reverse engineering a relationa model from Meta Kaggle data. .. image:: ../imgs/KGTorrent_logical_schema.png :width: 1200 :alt: KGTorrent logical schema All the most relevant relationships among the db tables are explicitly represented in the schema. However, we decided to omit some of them to ensure a good readability of the image.