Open in Colab Open On GitHub

Datasets#

The dataset module provides and easy way to load and preprocess the datasets. The package comes with a few datasets that are commonly used in topic modleing research. The datasets are:

  • 20NewsGroup

  • BBC_News

  • Stocktwits_GME

  • Reddit_GME’

  • Reuters’

  • Spotify

  • Spotify_most_popular

  • Poliblogs

  • Spotify_least_popular

Please see the functionalities availabe in the TMDataset module.

Note: Make sure the nltk dependencies are installed. If not, please run the following command:

import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
# uncomment the below line if running in Colab
# package neeeds to be installed for the notebook to run

# ! pip install -U stream_topic
import warnings
warnings.filterwarnings("ignore")
from stream_topic.utils import TMDataset

Using default datasets#

  • these datasets are already preprocessed and ready to be used for topic modeling

  • these datasets are included in the package and can be loaded using the TMDataset module

dataset = TMDataset()
dataset.fetch_dataset(name="Reuters")
2024-08-09 15:32:39.680 | INFO     | stream_topic.utils.dataset:fetch_dataset:118 - Fetching dataset: Reuters
2024-08-09 15:32:40.002 | INFO     | stream_topic.utils.data_downloader:load_custom_dataset_from_url:331 - Downloading dataset from github
2024-08-09 15:32:40.363 | INFO     | stream_topic.utils.data_downloader:load_custom_dataset_from_url:333 - Dataset downloaded successfully at ~/stream_topic_data/
2024-08-09 15:32:40.757 | INFO     | stream_topic.utils.data_downloader:load_custom_dataset_from_url:361 - Downloading dataset info from github
2024-08-09 15:32:40.970 | INFO     | stream_topic.utils.data_downloader:load_custom_dataset_from_url:363 - Dataset info downloaded successfully at ~/stream_topic_data/
dataset.get_bow()
(array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]], dtype=float32),
 array(['00', '000', '001', ..., 'zurich', 'zverev', 'zzzz'], dtype=object))
dataset.get_tfidf()
(array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]),
 array(['00', '000', '001', ..., 'zurich', 'zverev', 'zzzz'], dtype=object))
# dataset.get_word_embeddings()
dataset.fetch_dataset('Spotify')
2024-08-09 15:32:42.196 | INFO     | stream_topic.utils.dataset:fetch_dataset:108 - Dataset name already provided while instantiating the class: Reuters
2024-08-09 15:32:42.196 | INFO     | stream_topic.utils.dataset:fetch_dataset:111 - Overwriting the dataset name with the name provided in fetch_dataset: Spotify
2024-08-09 15:32:42.196 | INFO     | stream_topic.utils.dataset:fetch_dataset:115 - Fetching dataset: Spotify
2024-08-09 15:32:42.490 | INFO     | stream_topic.utils.data_downloader:load_custom_dataset_from_url:331 - Downloading dataset from github
2024-08-09 15:32:43.475 | INFO     | stream_topic.utils.data_downloader:load_custom_dataset_from_url:333 - Dataset downloaded successfully at ~/stream_topic_data/
2024-08-09 15:32:43.813 | INFO     | stream_topic.utils.data_downloader:load_custom_dataset_from_url:361 - Downloading dataset info from github
2024-08-09 15:32:43.977 | INFO     | stream_topic.utils.data_downloader:load_custom_dataset_from_url:363 - Dataset info downloaded successfully at ~/stream_topic_data/
dataset.dataframe.head()
name duration_ms explicit artists release_date danceability energy key loudness mode speechiness acousticness instrumentalness liveness valence tempo time_signature text labels tokens
0 What They Want 165853 1 ['Russ'] 2017-05-05 0.710 0.404 1 -10.040 0 0.3790 0.48400 0.000000 0.0953 0.398 139.553 4 yeah ooh yeah they let the rap game yeah yeah ... 75 [yeah, ooh, yeah, they, let, the, rap, game, y...
1 Shores 281367 0 ['Seinabo Sey', 'Vargas & Lagola'] 2019-09-20 0.431 0.491 5 -6.615 1 0.0288 0.32200 0.000000 0.0679 0.275 143.879 4 seinabo sey have always wondered your cause wh... 58 [seinabo, sey, have, always, wondered, your, c...
2 The Prayer 255360 0 ['Anthony Callea'] 2005 0.217 0.460 10 -5.133 1 0.0302 0.76800 0.000008 0.0847 0.109 138.822 4 youll our eyes and watch where and when dont k... 37 [youll, our, eyes, and, watch, where, and, whe...
3 Send Me the Pillow You Dream On 147440 0 ['Hank Locklin'] 2003-03-03 0.595 0.308 3 -11.626 1 0.0333 0.84000 0.000004 0.0942 0.624 119.755 4 send the pillow that you dream dont you know t... 45 [send, the, pillow, that, you, dream, dont, yo...
4 It's a Rainy Day 255400 0 ['Ice Mc'] 2008-03-16 0.619 0.736 2 -11.686 0 0.0302 0.00482 0.001050 0.3350 0.484 134.955 4 alexia and ice you the came down you were life... 41 [alexia, and, ice, you, the, came, down, you, ...
dataset.texts[:2]
['yeah ooh yeah they let the rap game yeah yeah yeah yeah they let this rap game yeah yeah got chick call her she feel like the like and some and feel like she feel like but she aint the only one got chick call her she she and off now she just got the the her they but they aint the only what they want what they want what they want dollar signs yeah know its what they want what they want what they want what they want yall aint fooling all ooh ooh ooh ooh this now they call they yeah off probably the only one yeah when you you all the like got the and and some probably the only one yeah what they want what they want what they want dollar signs yeah know its what they want what they want what they want what they want yall aint fooling all ooh ooh ooh ooh who ill you who fuck just all the what all the fuck they like the but know what they want aint its and but pop pop the let the boss when boss ill what they want what they want what they want dollar signs yeah know its what they want what they want what they want what they want yall aint fooling all ooh ooh ooh ooh',
 'seinabo sey have always wondered your cause when you you can not and ive been you your just see one seinabo sey will you ever tell what really your mind cause the greatest love youll ever find seinabo sey moving life better shores you see shores you see sure moving somebody thats sure sure sure see moving moving have always wondered you your like every word the ive always been and every word but the seinabo sey vargas lagola will you ever tell what really your mind cause the greatest love youll ever find you know seinabo sey vargas lagola seinabo sey moving life better shores you see shores you see sure moving somebody thats sure sure sure see moving cause not seinabo sey vargas lagola seinabo sey and you cant hold heart heart heart heart and you cant hold heart heart heart heart you cant hold heart heart hold heart and you cant hold heart heart heart and you cant hold heart heart you cant you cant you cant you cant hold heart heart heart heart heart moving moving ohohooh ohohooh ohohooh']
dataset.tokens
dataset.labels[:2]
[75, 58]

Loading own dataset#

import pandas as pd
import numpy as np


# Simulating some example data
np.random.seed(0)

# Generate 1000 random strings of lengths between 1 and 5, containing letters 'A' to 'Z'
random_documents = [''.join(np.random.choice(list('ABCDEFGHIJKLMNOPQRSTUVWXYZ'), 
                                             np.random.randint(1, 6))) for _ in range(1000)]

# Generate 1000 random labels from 1 to 4 as strings
random_labels = np.random.choice(['1', '2', '3', '4'], 1000)

# Create DataFrame
my_data = pd.DataFrame({"Documents": random_documents, "Labels": random_labels})
dataset = TMDataset()
dataset.create_load_save_dataset(
    data=my_data, 
    dataset_name="sample_data",
    save_dir="data/",
    doc_column="Documents",
    label_column="Labels"
    )
Preprocessing documents: 100%|██████████| 1000/1000 [00:03<00:00, 251.82it/s]
2024-08-09 15:32:48.092 | INFO     | stream_topic.utils.dataset:create_load_save_dataset:237 - Dataset saved to data/sample_data.parquet
2024-08-09 15:32:48.093 | INFO     | stream_topic.utils.dataset:create_load_save_dataset:252 - Dataset info saved to data/sample_data_info.pkl
# the new data is saved in the data folder unlike the default datasets which are saved in package directory under preprocessed_data folder.
# therefore, you need to provide the path to the data folder to fetch the dataset
dataset.fetch_dataset(name="sample_data", dataset_path="data/", source="local")
2024-08-09 15:32:48.097 | INFO     | stream_topic.utils.dataset:fetch_dataset:118 - Fetching dataset: sample_data
2024-08-09 15:32:48.098 | INFO     | stream_topic.utils.dataset:fetch_dataset:128 - Fetching dataset from local path
dataset.dataframe.head()
text labels tokens
0 PVADD 2 [PVADD]
1 TV 4 [TV]
2 EXG 4 [EXG]
3 Y 4 [Y]
4 BGHXO 3 [BGHXO]