Quickstart

Contents

Quickstart#

Note: Make sure the nltk dependencies are installed. If not, please run the following command:

import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

# uncomment the below line if running in Colab
# package neeeds to be installed for the notebook to run

# ! pip install -U stream_topic

import warnings
warnings.filterwarnings("ignore")

from stream_topic.models import CEDC
from stream_topic.utils import TMDataset

CEDC model#

dataset = TMDataset()
dataset.fetch_dataset("BBC_News")
dataset.preprocess(model_type="CEDC")

2024-08-09 15:35:15.170 | INFO     | stream_topic.utils.dataset:fetch_dataset:118 - Fetching dataset: BBC_News
2024-08-09 15:35:15.244 | INFO     | stream_topic.utils.data_downloader:load_custom_dataset_from_url:331 - Downloading dataset from github
2024-08-09 15:35:15.518 | INFO     | stream_topic.utils.data_downloader:load_custom_dataset_from_url:333 - Dataset downloaded successfully at ~/stream_topic_data/
2024-08-09 15:35:15.663 | INFO     | stream_topic.utils.data_downloader:load_custom_dataset_from_url:361 - Downloading dataset info from github
2024-08-09 15:35:15.795 | INFO     | stream_topic.utils.data_downloader:load_custom_dataset_from_url:363 - Dataset info downloaded successfully at ~/stream_topic_data/
Preprocessing documents: 100%|██████████| 2225/2225 [00:11<00:00, 198.52it/s]

model = CEDC()
output = model.fit(dataset, n_topics=10)

2024-08-09 15:35:27.056 | INFO     | stream_topic.models.CEDC:fit:241 - --- Training CEDC topic model ---
2024-08-09 15:35:27.122 | INFO     | stream_topic.models.abstract_helper_models.base:prepare_embeddings:215 - --- Loading precomputed paraphrase-MiniLM-L3-v2 embeddings ---
2024-08-09 15:35:27.191 | INFO     | stream_topic.utils.data_downloader:load_custom_dataset_from_url:302 - Downloading embeddings from github
2024-08-09 15:35:27.416 | INFO     | stream_topic.utils.data_downloader:load_custom_dataset_from_url:304 - Embeddings  downloaded successfully at ~/stream_topic_data/
2024-08-09 15:35:27.423 | INFO     | stream_topic.models.abstract_helper_models.base:dim_reduction:196 - --- Reducing dimensions ---
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
2024-08-09 15:35:32.238 | INFO     | stream_topic.models.CEDC:_clustering:175 - --- Creating document cluster ---
2024-08-09 15:35:37.431 | INFO     | stream_topic.models.CEDC:fit:259 - --- Extract topics ---
2024-08-09 15:35:41.513 | INFO     | stream_topic.models.CEDC:fit:284 - --- Training completed successfully. ---

from stream_topic.visuals import visualize_topic_model

visualize_topic_model(
    model, 
    reduce_first=True, 
    port=8052,
    )

CTMNeg model#

from stream_topic.models import CTMNeg
dataset = TMDataset()
dataset.fetch_dataset("BBC_News")
dataset.preprocess(model_type="CTMNeg")
model = CTMNeg(encoder_dim=64, dropout=0.3)
output = model.fit(dataset, n_topics=5, max_epochs=2)

2024-08-09 15:35:45.415 | INFO     | stream_topic.utils.dataset:fetch_dataset:118 - Fetching dataset: BBC_News
2024-08-09 15:35:45.492 | INFO     | stream_topic.utils.data_downloader:load_custom_dataset_from_url:331 - Downloading dataset from github
2024-08-09 15:35:45.691 | INFO     | stream_topic.utils.data_downloader:load_custom_dataset_from_url:333 - Dataset downloaded successfully at ~/stream_topic_data/
2024-08-09 15:35:45.786 | INFO     | stream_topic.utils.data_downloader:load_custom_dataset_from_url:361 - Downloading dataset info from github
2024-08-09 15:35:45.926 | INFO     | stream_topic.utils.data_downloader:load_custom_dataset_from_url:363 - Dataset info downloaded successfully at ~/stream_topic_data/
Preprocessing documents: 100%|██████████| 2225/2225 [00:10<00:00, 213.03it/s]
2024-08-09 15:35:56.466 | INFO     | stream_topic.models.abstract_helper_models.base:prepare_embeddings:215 - --- Loading precomputed paraphrase-MiniLM-L3-v2 embeddings ---
2024-08-09 15:35:56.539 | INFO     | stream_topic.utils.data_downloader:load_custom_dataset_from_url:302 - Downloading embeddings from github
2024-08-09 15:35:56.851 | INFO     | stream_topic.utils.data_downloader:load_custom_dataset_from_url:304 - Embeddings  downloaded successfully at ~/stream_topic_data/
2024-08-09 15:35:56.860 | INFO     | stream_topic.models.ctmneg:_initialize_datamodule:314 - --- Initializing Datamodule for CTMNeg ---
2024-08-09 15:35:57.069 | INFO     | stream_topic.models.ctmneg:_initialize_trainer:273 - --- Initializing Trainer for CTMNeg ---
Trainer already configured with model summary callbacks: [<class 'lightning.pytorch.callbacks.model_summary.ModelSummary'>]. Skipping setting a default `ModelSummary` callback.
GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
2024-08-09 15:35:57.094 | INFO     | stream_topic.models.ctmneg:fit:457 - --- Training CTMNeg topic model ---

  | Name                    | Type              | Params | Mode 
----------------------------------------------------------------------
0 | model                   | CTMNegBase        | 6.9 M  | train
1 | model.inference_network | InferenceNetwork  | 6.8 M  | train
2 | model.mean_bn           | BatchNorm1d       | 10     | train
3 | model.logvar_bn         | BatchNorm1d       | 10     | train
4 | model.beta_batchnorm    | BatchNorm1d       | 26.6 K | train
5 | model.theta_drop        | Dropout           | 0      | train
6 | model.triplet_loss      | TripletMarginLoss | 0      | train
----------------------------------------------------------------------
6.9 M     Trainable params
13.3 K    Non-trainable params
6.9 M     Total params
27.615    Total estimated model params size (MB)

2024-08-09 15:35:59.005 | INFO     | stream_topic.models.ctmneg:fit:473 - --- Training completed successfully. ---