Quickstart#
Note: Make sure the nltk dependencies are installed. If not, please run the following command:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
# uncomment the below line if running in Colab
# package neeeds to be installed for the notebook to run
# ! pip install -U stream_topic
import warnings
warnings.filterwarnings("ignore")
from stream_topic.models import CEDC
from stream_topic.utils import TMDataset
CEDC model#
dataset = TMDataset()
dataset.fetch_dataset("BBC_News")
dataset.preprocess(model_type="CEDC")
2024-08-09 15:35:15.170 | INFO | stream_topic.utils.dataset:fetch_dataset:118 - Fetching dataset: BBC_News
2024-08-09 15:35:15.244 | INFO | stream_topic.utils.data_downloader:load_custom_dataset_from_url:331 - Downloading dataset from github
2024-08-09 15:35:15.518 | INFO | stream_topic.utils.data_downloader:load_custom_dataset_from_url:333 - Dataset downloaded successfully at ~/stream_topic_data/
2024-08-09 15:35:15.663 | INFO | stream_topic.utils.data_downloader:load_custom_dataset_from_url:361 - Downloading dataset info from github
2024-08-09 15:35:15.795 | INFO | stream_topic.utils.data_downloader:load_custom_dataset_from_url:363 - Dataset info downloaded successfully at ~/stream_topic_data/
Preprocessing documents: 100%|██████████| 2225/2225 [00:11<00:00, 198.52it/s]
model = CEDC()
output = model.fit(dataset, n_topics=10)
2024-08-09 15:35:27.056 | INFO | stream_topic.models.CEDC:fit:241 - --- Training CEDC topic model ---
2024-08-09 15:35:27.122 | INFO | stream_topic.models.abstract_helper_models.base:prepare_embeddings:215 - --- Loading precomputed paraphrase-MiniLM-L3-v2 embeddings ---
2024-08-09 15:35:27.191 | INFO | stream_topic.utils.data_downloader:load_custom_dataset_from_url:302 - Downloading embeddings from github
2024-08-09 15:35:27.416 | INFO | stream_topic.utils.data_downloader:load_custom_dataset_from_url:304 - Embeddings downloaded successfully at ~/stream_topic_data/
2024-08-09 15:35:27.423 | INFO | stream_topic.models.abstract_helper_models.base:dim_reduction:196 - --- Reducing dimensions ---
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
2024-08-09 15:35:32.238 | INFO | stream_topic.models.CEDC:_clustering:175 - --- Creating document cluster ---
2024-08-09 15:35:37.431 | INFO | stream_topic.models.CEDC:fit:259 - --- Extract topics ---
2024-08-09 15:35:41.513 | INFO | stream_topic.models.CEDC:fit:284 - --- Training completed successfully. ---
from stream_topic.visuals import visualize_topic_model
visualize_topic_model(
model,
reduce_first=True,
port=8052,
)
CTMNeg model#
from stream_topic.models import CTMNeg
dataset = TMDataset()
dataset.fetch_dataset("BBC_News")
dataset.preprocess(model_type="CTMNeg")
model = CTMNeg(encoder_dim=64, dropout=0.3)
output = model.fit(dataset, n_topics=5, max_epochs=2)
2024-08-09 15:35:45.415 | INFO | stream_topic.utils.dataset:fetch_dataset:118 - Fetching dataset: BBC_News
2024-08-09 15:35:45.492 | INFO | stream_topic.utils.data_downloader:load_custom_dataset_from_url:331 - Downloading dataset from github
2024-08-09 15:35:45.691 | INFO | stream_topic.utils.data_downloader:load_custom_dataset_from_url:333 - Dataset downloaded successfully at ~/stream_topic_data/
2024-08-09 15:35:45.786 | INFO | stream_topic.utils.data_downloader:load_custom_dataset_from_url:361 - Downloading dataset info from github
2024-08-09 15:35:45.926 | INFO | stream_topic.utils.data_downloader:load_custom_dataset_from_url:363 - Dataset info downloaded successfully at ~/stream_topic_data/
Preprocessing documents: 100%|██████████| 2225/2225 [00:10<00:00, 213.03it/s]
2024-08-09 15:35:56.466 | INFO | stream_topic.models.abstract_helper_models.base:prepare_embeddings:215 - --- Loading precomputed paraphrase-MiniLM-L3-v2 embeddings ---
2024-08-09 15:35:56.539 | INFO | stream_topic.utils.data_downloader:load_custom_dataset_from_url:302 - Downloading embeddings from github
2024-08-09 15:35:56.851 | INFO | stream_topic.utils.data_downloader:load_custom_dataset_from_url:304 - Embeddings downloaded successfully at ~/stream_topic_data/
2024-08-09 15:35:56.860 | INFO | stream_topic.models.ctmneg:_initialize_datamodule:314 - --- Initializing Datamodule for CTMNeg ---
2024-08-09 15:35:57.069 | INFO | stream_topic.models.ctmneg:_initialize_trainer:273 - --- Initializing Trainer for CTMNeg ---
Trainer already configured with model summary callbacks: [<class 'lightning.pytorch.callbacks.model_summary.ModelSummary'>]. Skipping setting a default `ModelSummary` callback.
GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
2024-08-09 15:35:57.094 | INFO | stream_topic.models.ctmneg:fit:457 - --- Training CTMNeg topic model ---
| Name | Type | Params | Mode
----------------------------------------------------------------------
0 | model | CTMNegBase | 6.9 M | train
1 | model.inference_network | InferenceNetwork | 6.8 M | train
2 | model.mean_bn | BatchNorm1d | 10 | train
3 | model.logvar_bn | BatchNorm1d | 10 | train
4 | model.beta_batchnorm | BatchNorm1d | 26.6 K | train
5 | model.theta_drop | Dropout | 0 | train
6 | model.triplet_loss | TripletMarginLoss | 0 | train
----------------------------------------------------------------------
6.9 M Trainable params
13.3 K Non-trainable params
6.9 M Total params
27.615 Total estimated model params size (MB)
2024-08-09 15:35:59.005 | INFO | stream_topic.models.ctmneg:fit:473 - --- Training completed successfully. ---