Datasets

Contents

Datasets#

The dataset module provides and easy way to load and preprocess the datasets. The package comes with a few datasets that are commonly used in topic modleing research. The datasets are:

20NewsGroup
BBC_News
Stocktwits_GME
Reddit_GME’
Reuters’
Spotify
Spotify_most_popular
Poliblogs
Spotify_least_popular

Please see the functionalities availabe in the TMDataset module.

Note: Make sure the nltk dependencies are installed. If not, please run the following command:

import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

# uncomment the below line if running in Colab
# package neeeds to be installed for the notebook to run

# ! pip install -U stream_topic

import warnings
warnings.filterwarnings("ignore")

from stream_topic.utils import TMDataset

Using default datasets#

these datasets are already preprocessed and ready to be used for topic modeling
these datasets are included in the package and can be loaded using the TMDataset module

dataset = TMDataset()
dataset.fetch_dataset(name="Reuters")

2024-08-09 15:32:39.680 | INFO     | stream_topic.utils.dataset:fetch_dataset:118 - Fetching dataset: Reuters
2024-08-09 15:32:40.002 | INFO     | stream_topic.utils.data_downloader:load_custom_dataset_from_url:331 - Downloading dataset from github
2024-08-09 15:32:40.363 | INFO     | stream_topic.utils.data_downloader:load_custom_dataset_from_url:333 - Dataset downloaded successfully at ~/stream_topic_data/
2024-08-09 15:32:40.757 | INFO     | stream_topic.utils.data_downloader:load_custom_dataset_from_url:361 - Downloading dataset info from github
2024-08-09 15:32:40.970 | INFO     | stream_topic.utils.data_downloader:load_custom_dataset_from_url:363 - Dataset info downloaded successfully at ~/stream_topic_data/

dataset.get_bow()

(array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]], dtype=float32),
 array(['00', '000', '001', ..., 'zurich', 'zverev', 'zzzz'], dtype=object))

dataset.get_tfidf()

(array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]),
 array(['00', '000', '001', ..., 'zurich', 'zverev', 'zzzz'], dtype=object))

# dataset.get_word_embeddings()

dataset.fetch_dataset('Spotify')

2024-08-09 15:32:42.196 | INFO     | stream_topic.utils.dataset:fetch_dataset:108 - Dataset name already provided while instantiating the class: Reuters
2024-08-09 15:32:42.196 | INFO     | stream_topic.utils.dataset:fetch_dataset:111 - Overwriting the dataset name with the name provided in fetch_dataset: Spotify
2024-08-09 15:32:42.196 | INFO     | stream_topic.utils.dataset:fetch_dataset:115 - Fetching dataset: Spotify
2024-08-09 15:32:42.490 | INFO     | stream_topic.utils.data_downloader:load_custom_dataset_from_url:331 - Downloading dataset from github
2024-08-09 15:32:43.475 | INFO     | stream_topic.utils.data_downloader:load_custom_dataset_from_url:333 - Dataset downloaded successfully at ~/stream_topic_data/
2024-08-09 15:32:43.813 | INFO     | stream_topic.utils.data_downloader:load_custom_dataset_from_url:361 - Downloading dataset info from github
2024-08-09 15:32:43.977 | INFO     | stream_topic.utils.data_downloader:load_custom_dataset_from_url:363 - Dataset info downloaded successfully at ~/stream_topic_data/

dataset.dataframe.head()

	name	duration_ms	explicit	artists	release_date	danceability	energy	key	loudness	mode	speechiness	acousticness	instrumentalness	liveness	valence	tempo	time_signature	text	labels	tokens
0	What They Want	165853	1	['Russ']	2017-05-05	0.710	0.404	1	-10.040	0	0.3790	0.48400	0.000000	0.0953	0.398	139.553	4	yeah ooh yeah they let the rap game yeah yeah ...	75	[yeah, ooh, yeah, they, let, the, rap, game, y...
1	Shores	281367	0	['Seinabo Sey', 'Vargas & Lagola']	2019-09-20	0.431	0.491	5	-6.615	1	0.0288	0.32200	0.000000	0.0679	0.275	143.879	4	seinabo sey have always wondered your cause wh...	58	[seinabo, sey, have, always, wondered, your, c...
2	The Prayer	255360	0	['Anthony Callea']	2005	0.217	0.460	10	-5.133	1	0.0302	0.76800	0.000008	0.0847	0.109	138.822	4	youll our eyes and watch where and when dont k...	37	[youll, our, eyes, and, watch, where, and, whe...
3	Send Me the Pillow You Dream On	147440	0	['Hank Locklin']	2003-03-03	0.595	0.308	3	-11.626	1	0.0333	0.84000	0.000004	0.0942	0.624	119.755	4	send the pillow that you dream dont you know t...	45	[send, the, pillow, that, you, dream, dont, yo...
4	It's a Rainy Day	255400	0	['Ice Mc']	2008-03-16	0.619	0.736	2	-11.686	0	0.0302	0.00482	0.001050	0.3350	0.484	134.955	4	alexia and ice you the came down you were life...	41	[alexia, and, ice, you, the, came, down, you, ...

dataset.texts[:2]

['yeah ooh yeah they let the rap game yeah yeah yeah yeah they let this rap game yeah yeah got chick call her she feel like the like and some and feel like she feel like but she aint the only one got chick call her she she and off now she just got the the her they but they aint the only what they want what they want what they want dollar signs yeah know its what they want what they want what they want what they want yall aint fooling all ooh ooh ooh ooh this now they call they yeah off probably the only one yeah when you you all the like got the and and some probably the only one yeah what they want what they want what they want dollar signs yeah know its what they want what they want what they want what they want yall aint fooling all ooh ooh ooh ooh who ill you who fuck just all the what all the fuck they like the but know what they want aint its and but pop pop the let the boss when boss ill what they want what they want what they want dollar signs yeah know its what they want what they want what they want what they want yall aint fooling all ooh ooh ooh ooh',
 'seinabo sey have always wondered your cause when you you can not and ive been you your just see one seinabo sey will you ever tell what really your mind cause the greatest love youll ever find seinabo sey moving life better shores you see shores you see sure moving somebody thats sure sure sure see moving moving have always wondered you your like every word the ive always been and every word but the seinabo sey vargas lagola will you ever tell what really your mind cause the greatest love youll ever find you know seinabo sey vargas lagola seinabo sey moving life better shores you see shores you see sure moving somebody thats sure sure sure see moving cause not seinabo sey vargas lagola seinabo sey and you cant hold heart heart heart heart and you cant hold heart heart heart heart you cant hold heart heart hold heart and you cant hold heart heart heart and you cant hold heart heart you cant you cant you cant you cant hold heart heart heart heart heart moving moving ohohooh ohohooh ohohooh']

dataset.tokens

dataset.labels[:2]

[75, 58]

Loading own dataset#

import pandas as pd
import numpy as np


# Simulating some example data
np.random.seed(0)

# Generate 1000 random strings of lengths between 1 and 5, containing letters 'A' to 'Z'
random_documents = [''.join(np.random.choice(list('ABCDEFGHIJKLMNOPQRSTUVWXYZ'), 
                                             np.random.randint(1, 6))) for _ in range(1000)]

# Generate 1000 random labels from 1 to 4 as strings
random_labels = np.random.choice(['1', '2', '3', '4'], 1000)

# Create DataFrame
my_data = pd.DataFrame({"Documents": random_documents, "Labels": random_labels})

dataset = TMDataset()
dataset.create_load_save_dataset(
    data=my_data, 
    dataset_name="sample_data",
    save_dir="data/",
    doc_column="Documents",
    label_column="Labels"
    )

Preprocessing documents: 100%|██████████| 1000/1000 [00:03<00:00, 251.82it/s]
2024-08-09 15:32:48.092 | INFO     | stream_topic.utils.dataset:create_load_save_dataset:237 - Dataset saved to data/sample_data.parquet
2024-08-09 15:32:48.093 | INFO     | stream_topic.utils.dataset:create_load_save_dataset:252 - Dataset info saved to data/sample_data_info.pkl

# the new data is saved in the data folder unlike the default datasets which are saved in package directory under preprocessed_data folder.
# therefore, you need to provide the path to the data folder to fetch the dataset
dataset.fetch_dataset(name="sample_data", dataset_path="data/", source="local")

2024-08-09 15:32:48.097 | INFO     | stream_topic.utils.dataset:fetch_dataset:118 - Fetching dataset: sample_data
2024-08-09 15:32:48.098 | INFO     | stream_topic.utils.dataset:fetch_dataset:128 - Fetching dataset from local path

dataset.dataframe.head()

	text	labels	tokens
0	PVADD	2	[PVADD]
1	TV	4	[TV]
2	EXG	4	[EXG]
3	Y	4	[Y]
4	BGHXO	3	[BGHXO]