# Musora Sentiment Analysis Dashboard

A Streamlit dashboard for visualising sentiment analysis results from **social media comments** (Facebook, Instagram, YouTube, Twitter), the **Musora internal app**, **HelpScout customer support conversations**, and **Learning Paths** lesson engagement data across brands (Drumeo, Pianote, Guitareo, Singeo, Playbass).

---

## Table of Contents

1. [Project Structure](#project-structure)
2. [How Data Flows](#how-data-flows)
3. [Data Loading Strategy](#data-loading-strategy)
4. [Pages](#pages)
5. [Global Filters & Session State](#global-filters--session-state)
6. [Snowflake Queries](#snowflake-queries)
7. [Authentication](#authentication)
8. [PDF Reports](#pdf-reports)
9. [AI Agents](#ai-agents)
10. [Adding or Changing Things](#adding-or-changing-things)
11. [Running the App](#running-the-app)
12. [Configuration Reference](#configuration-reference)

---

## Project Structure

```
visualization/
├── app.py                              # Entry point — routing, sidebar, session state
├── config/
│   └── viz_config.json                 # Colors, query strings, dashboard settings
├── data/
│   ├── data_loader.py                  # Comment Snowflake queries and caching
│   └── helpscout_data_loader.py        # HelpScout Snowflake queries and caching
├── utils/
│   ├── auth.py                         # Login page, authentication helpers
│   ├── data_processor.py               # Pandas aggregations (intent dist, content summary, etc.)
│   ├── metrics.py                      # KPI calculations (sentiment score, urgency, etc.)
│   ├── pdf_exporter.py                 # DashboardPDFExporter (comment dashboard PDF)
│   ├── helpscout_utils.py              # Pure helpers: parse_topics, explode_topics, boolean_flag_counts
│   └── helpscout_pdf.py                # HelpScoutDashboardPDF + HelpScoutAnalysisPDF
├── components/
│   ├── dashboard.py                    # Comment Dashboard page renderer
│   ├── sentiment_analysis.py           # Sentiment Analysis page renderer
│   ├── reply_required.py               # Reply Required page renderer
│   ├── helpscout_dashboard.py          # HelpScout Dashboard page + compact summary widget
│   └── helpscout_analysis.py           # HelpScout Analysis page (filter→fetch→charts→LLM→PDF)
├── visualizations/
│   ├── sentiment_charts.py             # Plotly sentiment chart functions
│   ├── distribution_charts.py          # Plotly distribution / heatmap / scatter functions
│   ├── demographic_charts.py           # Plotly demographic chart functions
│   ├── content_cards.py                # Streamlit card components (comment + content cards)
│   └── helpscout_charts.py             # HelpScoutCharts Plotly factory (16 chart types)
├── agents/
│   ├── base_agent.py                   # BaseVisualizationAgent (shared interface)
│   ├── content_summary_agent.py        # AI analysis for comment content summarisation
│   └── helpscout_summary_agent.py      # HelpScoutSummaryAgent — page-level LLM summary from SUMMARY fields
├── img/
│   └── musora.png                      # Sidebar logo
└── SnowFlakeConnection.py              # Snowflake connection wrapper (Snowpark session)
```

---

## How Data Flows

```
Snowflake
    │
    ├── data_loader.py (SentimentDataLoader)
    │       ├── load_dashboard_data()        ──► st.session_state['dashboard_df']
    │       │                                       └─► sidebar (filter options, counts)
    │       │                                       └─► dashboard.py (all charts)
    │       ├── load_sa_data()               ──► st.session_state['sa_contents', 'sa_comments']
    │       │   (on-demand, Fetch button)           └─► sentiment_analysis.py
    │       └── load_reply_required_data()   ──► st.session_state['rr_df']
    │           (on-demand, Fetch button)           └─► reply_required.py
    │
    └── helpscout_data_loader.py (HelpScoutDataLoader)
            ├── load_dashboard_data()        ──► st.session_state['helpscout_df']
            │                                       └─► helpscout_dashboard.py
            │                                       └─► dashboard.py (compact summary)
            └── load_analysis_data()         ──► st.session_state['hs_analysis_df']
                (on-demand, Fetch button)           └─► helpscout_analysis.py
```

**Key principle:** Data is loaded as little as possible, as late as possible.

- **Dashboard** queries are lightweight (no text columns, no content join) and cached 24 hours.
- **Sentiment Analysis**, **Reply Required**, and **HelpScout Analysis** pages wait for the user to click **Fetch Data**.
- All data lives in `st.session_state` so page navigation and widget interactions never re-trigger Snowflake queries.

---

## Data Loading Strategy

### Comment data (`data/data_loader.py` — `SentimentDataLoader`)

#### `load_dashboard_data()`
- Fetches only: `comment_sk, content_sk, platform, brand, sentiment_polarity, intent, requires_reply, detected_language, comment_timestamp, processed_at, author_id`.
- No text columns, no `DIM_CONTENT` join.
- Merges demographics data if `demographics_query` is configured.
- Cached **24 hours**. Called once at startup; stored in `st.session_state['dashboard_df']`.

#### `load_sa_data(platform, brand, top_n, min_comments, sort_by, sentiments, intents, emotions, date_range)`
- Runs two Snowflake queries:
  1. **Content aggregation** — groups by `content_sk`, counts per sentiment, computes severity score, returns top N.
  2. **Sampled comments** — up to 50 per sentiment group per content (`QUALIFY ROW_NUMBER() <= 50`). `display_text` computed in SQL.
- Returns `(contents_df, comments_df)`. Cached **24 hours**.

#### `load_reply_required_data(platforms, brands, date_range)`
- Filters `REQUIRES_REPLY = TRUE`. Conditionally includes the social media table and/or musora table. Cached **24 hours**.

#### SQL column qualification note
The social media table and `DIM_CONTENT` share column names. Any `WHERE` clause inside a query that joins them **must** use the table alias prefix (e.g. `s.PLATFORM`, `s.COMMENT_TIMESTAMP`) to avoid Snowflake `ambiguous column name` errors.

---

### HelpScout data (`data/helpscout_data_loader.py` — `HelpScoutDataLoader`)

#### `load_dashboard_data()`
- Lightweight query from `SOCIAL_MEDIA_DB.ML_FEATURES.HELPSCOUT_CONVERSATION_FEATURES`.
- Columns: `conversation_id, status, source, created_at, updated_at, duration_hours, sentiment_polarity, topics, is_refund_request, is_cancellation, is_membership, customer_email`.
- Merges demographics (age/timezone/experience) via email join (`LOWER(customer_email) = LOWER(usora_users.email)`).
- After the demographics merge, adds **`is_member`** boolean: `True` when the customer email matched a Musora user record, `False` otherwise.
- Cached **24 hours**. Stored in `st.session_state['helpscout_df']`.

#### `load_analysis_data(date_start, date_end, topics, sentiments, statuses, sources, is_refund, is_cancellation, is_membership)`
- Adds `summary, sentiment_notes, topic_notes, customer_first_name, customer_last_name` columns.
- SQL `WHERE` pushdown for all filters; multi-label topic filter uses `ARRAY_CONTAINS('topic_id'::VARIANT, SPLIT(TOPICS, ','))`.
- Cached **24 hours** keyed on filter tuple. Stored in `st.session_state['hs_analysis_df']`.

#### `get_filter_options(df)`
- Returns `sentiments`, `topics` (exploded and label-mapped from taxonomy), `statuses`, `states`, `sources`.

---

## Pages

The app has **5 pages** navigated via the sidebar radio:

### 1. Sentiment Dashboard (`components/dashboard.py`)

**Receives:** `filtered_df` — lightweight comment dataframe (after optional global filter from `app.py`).

**Key sections:**
- Summary stats + health indicator
- Sentiment distribution (pie + gauge)
- Sentiment by brand and platform (stacked + percentage bar charts)
- Intent analysis (bar + pie)
- Emotion analysis (bar + pie) — only when `emotions` column is non-null
- Brand–Platform heatmap
- Reply requirements + urgency breakdown
- Demographics (age, timezone, experience) — only when demographics were merged
- **HelpScout compact summary** — appended at bottom; reads `st.session_state['helpscout_df']` directly (guarded by `try/except` so failures never break the main dashboard)

---

### 2. Custom Sentiment Queries (`components/sentiment_analysis.py`)

**Receives:** `data_loader` instance only.

**Flow:**
1. Reads `st.session_state['dashboard_df']` for filter option lists.
2. Pre-populates platform/brand dropdowns from `st.session_state['global_filters']`.
3. On **Fetch Data**: calls `data_loader.load_sa_data(...)`, stores results in `st.session_state['sa_contents']` and `['sa_comments']`.
4. Renders content cards, per-content sentiment + intent + emotion charts, AI analysis buttons, sampled comment expanders.

**Pagination:** `st.session_state['sentiment_page']` (5 contents per page). Reset on new fetch.

---

### 3. Reply Required (`components/reply_required.py`)

**Receives:** `data_loader` instance only.

**Flow:**
1. Pre-populates platform/brand/date from `st.session_state['global_filters']`.
2. On **Fetch Data**: calls `data_loader.load_reply_required_data(...)`, stores result in `st.session_state['rr_df']`.
3. Shows urgency breakdown, in-page filters (applied in Python, no extra query), paginated comment cards, and "Reply by Content" summary.

**Pagination:** `st.session_state['reply_page']` (10 comments per page). Reset on new fetch.

---

### 4. HelpScout Dashboard (`components/helpscout_dashboard.py`)

**Receives:** `helpscout_loader` instance.

**Reads from:** `st.session_state['helpscout_df']` (loaded at app startup).

**Key sections:**
- **Member status filter** (radio at top): "All Customers / Members Only / Non-Members Only" — filters the entire dashboard view before any section renders
- PDF export button (HelpScout Dashboard PDF)
- 6 KPI metrics: total conversations, average duration, refund requests, cancellations, negative rate, membership joins
- Sentiment distribution (pie + bar)
- Topic distribution and sentiment heatmap (from `process_helpscout/config_files/topics.json` taxonomy)
- Boolean flags (refund, cancellation, membership) breakdown
- Status and source breakdown
- Timelines expander (daily conversation volume, refund/cancel trend)
- Depth expander (topic co-occurrence, escalation funnel)
- **Member vs Non-Member section**: KPI metrics (member count, non-member count, email match rate) + member share pie chart + sentiment by member status stacked bar + top topics by member status grouped bar
- Demographics (age, timezone, experience)

> **Note:** Global sidebar filters (brand, platform, sentiment, date) do **not** apply to HelpScout pages — HelpScout is brand-agnostic and uses its own filter panel.

---

### 5. HelpScout Analysis (`components/helpscout_analysis.py`)

**Receives:** `helpscout_loader` instance.

**Flow:**
1. **Filter panel** — date range, top_n, topics (multi-select with human-readable labels), sentiments, statuses, sources, 3 boolean checkboxes (refund / cancellation / membership), and a **"Customer Type" selectbox** (All / Members Only / Non-Members Only).
2. **Fetch Data** button — calls `helpscout_loader.load_analysis_data(...)`, stale-checked via `fetch_key` tuple. The Customer Type filter is **not** part of the Snowflake query — it is applied in Python after fetching, using the member email set derived from `st.session_state['helpscout_df']`.
3. **KPI row** + distribution charts (sentiment, topics, flags, status).
4. **Member vs Non-Member section** — always rendered when member data is available; shows share pie, sentiment stacked bar, and top-topics grouped bar split by member status.
5. **AI Summary section:**
   - "Generate AI Summary" button → calls `HelpScoutSummaryAgent`, stores result in `st.session_state['hs_analysis_summary']`.
   - Renders: executive summary, top themes, top complaints, unexpected insights, notable quotes.
   - "Export Analysis PDF" button → generates `HelpScoutAnalysisPDF`.
6. **Paginated conversation cards** — 10 per page; each card shows customer name, status, topics (label-mapped), summary, sentiment/topic notes.
7. **CSV export** button.

**Pagination:** `st.session_state['hs_analysis_page']`. Reset on new fetch.

**Date range default:** Clamps to `max(min_date, max_date − default_date_range_days)` so the default is always within the available data window.

---

## Global Filters & Session State

Global filters apply **only to comment pages** (Dashboard, Sentiment Analysis, Reply Required). They have no effect on HelpScout pages.

```python
st.session_state['global_filters'] = {
    'platforms':  ['facebook', 'instagram'],
    'brands':     ['drumeo'],
    'sentiments': [],
    'date_range': (date(2025, 1, 1), date(2025, 12, 31)),  # or None
}
```

### Full session state key reference

| Key | Set by | Used by |
|-----|--------|---------|
| `dashboard_df` | `app.py` startup | sidebar, dashboard.py, SA + RR filter lists |
| `global_filters` | sidebar "Apply Filters" | app.py (dashboard filter), SA + RR pre-populate |
| `filters_applied` | sidebar buttons | app.py |
| `sa_contents` | SA fetch button | sentiment_analysis.py |
| `sa_comments` | SA fetch button | sentiment_analysis.py |
| `sa_fetch_key` | SA fetch button | SA stale-check |
| `rr_df` | RR fetch button | reply_required.py |
| `rr_fetch_key` | RR fetch button | RR stale-check |
| `sentiment_page` | SA page / fetch | SA pagination |
| `reply_page` | RR page / fetch | RR pagination |
| `content_summaries` | SA AI buttons | SA AI analysis display |
| `helpscout_df` | `app.py` startup | helpscout_dashboard.py (includes `is_member`), dashboard.py compact summary, helpscout_analysis.py member filter |
| `hs_analysis_df` | HS Analysis fetch | helpscout_analysis.py charts + cards |
| `hs_analysis_fetch_key` | HS Analysis fetch | HS Analysis stale-check |
| `hs_analysis_filter_desc` | HS Analysis fetch | human-readable filter string for PDF + agent |
| `hs_analysis_summary` | "Generate AI Summary" | HS Analysis summary renderer |
| `hs_analysis_summary_key` | "Generate AI Summary" | invalidated on re-fetch |
| `hs_analysis_page` | HS Analysis page / fetch | HS Analysis pagination |

---

## Snowflake Queries

### Comment tables

| Table | Platform | Notes |
|-------|----------|-------|
| `SOCIAL_MEDIA_DB.ML_FEATURES.COMMENT_SENTIMENT_FEATURES` | facebook, instagram, youtube, twitter | Needs `LEFT JOIN DIM_CONTENT` for `PERMALINK_URL` |
| `SOCIAL_MEDIA_DB.ML_FEATURES.MUSORA_COMMENT_SENTIMENT_FEATURES` | musora_app | Has `PERMALINK_URL` and `THUMBNAIL_URL` natively |

### HelpScout table

| Table | Notes |
|-------|-------|
| `SOCIAL_MEDIA_DB.ML_FEATURES.HELPSCOUT_CONVERSATION_FEATURES` | One row per conversation; multi-label topics in comma-separated `TOPICS` column |

### Static queries (in `viz_config.json`)

| Key | Purpose |
|-----|---------|
| `dashboard_query` | Lightweight comment query — no text, no DIM_CONTENT join |
| `demographics_query` | Joins `usora_users` + `preprocessed.users` for age/timezone/experience |
| `helpscout.dashboard_query` | Lightweight HelpScout query (no SUMMARY/notes) |
| `helpscout.demographics_query` | Same demographics join, keyed on `customer_email` |

### Dynamic queries (built in `helpscout_data_loader.py`)

| Method | Description |
|--------|-------------|
| `_build_analysis_query()` | Full HelpScout query including SUMMARY/notes; multi-label topic filter via `ARRAY_CONTAINS` |

---

## Authentication

Module: `utils/auth.py`

- `AUTHORIZED_EMAILS` allowlist + `APP_TOKEN` env var.
- `render_login_page()` renders the login form and calls `st.stop()` when not authenticated.
- Gate is placed at the top of `app.py` (after `st.set_page_config`, before data loaders).
- Current user and logout button are shown in the sidebar.

**Required env vars:**
```
APP_TOKEN=<shared token>
```

---

## PDF Reports

### Comment Dashboard PDF (`utils/pdf_exporter.py` — `DashboardPDFExporter`)

Generated from the "Export PDF Report" expander at the top of the Dashboard page.

Sections: cover, executive summary, sentiment, brand, platform, intent, cross-dimensional, volume, reply requirements, demographics (optional), language (optional), HelpScout summary (if data loaded), data summary.

### HelpScout Dashboard PDF (`utils/helpscout_pdf.py` — `HelpScoutDashboardPDF`)

Generated from the HelpScout Dashboard page. Sections: cover, KPI summary, sentiment, topics, flags & escalation, status & source, timelines, depth, **member vs non-member** (metrics + pie + sentiment bar + topic grouped bar), demographics.

### HelpScout Analysis PDF (`utils/helpscout_pdf.py` — `HelpScoutAnalysisPDF`)

Generated from the "Export Analysis PDF" button on the HelpScout Analysis page (only available after an AI Summary has been generated).

Sections: cover, filter summary, KPI summary (including member/non-member counts when available), chart snapshots, **member vs non-member breakdown** (pie + sentiment bar + topic grouped bar), AI summary (executive summary, top themes, top complaints, unexpected insights, notable quotes), conversation cards sample, metadata.

**Dependencies:** `fpdf2`, `kaleido` (for Plotly PNG rendering at 3× scale).

---

## AI Agents

### `ContentSummaryAgent` (`agents/content_summary_agent.py`)

Summarises sampled comments for a single content item on the Sentiment Analysis page. Called per-content when the user clicks the AI analysis button. Results cached in `st.session_state['content_summaries']`.

### `HelpScoutSummaryAgent` (`agents/helpscout_summary_agent.py`)

Produces a **page-level** executive report from the filtered HelpScout conversations by reading their pre-extracted `SUMMARY` fields through an LLM.

- Stratified sample by `sentiment_polarity` — capped at `max_conversations` (default 300).
- Builds aggregate context: sentiment breakdown, top topics, flag counts, average duration, then per-conversation summaries (capped at 250 chars each).
- Prompt asks the LLM to surface patterns **beyond** the pre-tagged topics/sentiments.
- Output structure:

```json
{
    "executive_summary": "...",
    "top_themes": [{"theme": "...", "description": "...", "prevalence": "..."}],
    "top_complaints": ["..."],
    "unexpected_insights": ["..."],
    "notable_quotes": ["..."]
}
```

- Uses `LLMHelper.get_structured_completion()` with up to 3 retries.

---

## Adding or Changing Things

### Add a new chart to the Comment Dashboard
1. Write the chart function in the appropriate `visualizations/` file.
2. Call it from `render_dashboard()` in `components/dashboard.py`.

### Add a new chart to the HelpScout Dashboard
1. Add the chart method to `HelpScoutCharts` in `visualizations/helpscout_charts.py`.
2. Call it from `render_helpscout_dashboard()` in `components/helpscout_dashboard.py`.

### Add a new HelpScout filter
1. Add the widget to the filter panel in `helpscout_analysis.py`.
2. Include the new value in the `fetch_key` tuple.
3. Add the corresponding `WHERE` clause condition to `_build_analysis_query()` in `helpscout_data_loader.py`.

> **Python-side filters** (those whose data is not in the Snowflake HelpScout table) are applied after fetching rather than in SQL. The member/non-member filter is the canonical example: `is_member` is derived from `st.session_state['helpscout_df']` after the Snowflake fetch. Such filters should **not** be included in the `fetch_key` tuple.

### Add a new HelpScout topic
- Edit `process_helpscout/config_files/topics.json` (the taxonomy file).
- `helpscout_utils.load_topic_taxonomy()` reloads it on each app start; no other changes needed.

### Change the cache duration
`@st.cache_data(ttl=86400)` appears on `load_dashboard_data`, `_fetch_sa_data`, `_fetch_rr_data`, `load_demographics_data`, and their HelpScout equivalents. Change `86400` to the desired TTL. Users can always force a refresh with "Reload Data" in the sidebar.

### Add a new page
1. Create `components/new_page.py` with a `render_new_page(...)` function.
2. Import and add a radio option in `app.py`.
3. Add data loading to the appropriate loader class.
4. If the page should be excluded from global comment filters, extend the `_hs_page` guard in `app.py`.

### Change what the Sentiment Analysis page queries
- Edit `_build_sa_content_query()` and/or `_build_sa_comments_query()` in `data_loader.py`.
- Update `_process_sa_content_stats()` and/or `_process_sa_comments()` for new columns.

---

## Running the App

```bash
# From the project root
streamlit run visualization/app.py
```

**Required environment variables** (in `.env` at project root):

```
SNOWFLAKE_USER
SNOWFLAKE_PASSWORD
SNOWFLAKE_ACCOUNT
SNOWFLAKE_ROLE
SNOWFLAKE_DATABASE
SNOWFLAKE_WAREHOUSE
SNOWFLAKE_SCHEMA
OPENAI_API_KEY
APP_TOKEN
```

---

## Configuration Reference

`config/viz_config.json` controls:

| Section | What it configures |
|---------|-------------------|
| `color_schemes.sentiment_polarity` | Hex colors for each sentiment level |
| `color_schemes.intent` | Hex colors per intent label |
| `color_schemes.emotion` | Hex colors per emotion label |
| `color_schemes.platform` | Hex colors per platform |
| `color_schemes.brand` | Hex colors per brand |
| `color_schemes_helpscout.topics` | Hex colors for HelpScout topic bars |
| `color_schemes_helpscout.status` | Hex colors for conversation status values |
| `color_schemes_helpscout.boolean_flags` | Hex colors for refund/cancellation/membership flags |
| `sentiment_order` | Display order for sentiment categories |
| `intent_order` | Display order for intent categories |
| `emotion_order` | Display order for emotion categories |
| `negative_sentiments` | Which sentiment values count as "negative" |
| `dashboard.default_date_range_days` | Default date filter window for comment pages |
| `helpscout.default_date_range_days` | Default date filter window for HelpScout Analysis |
| `helpscout.max_summary_conversations` | Cap on conversations sent to LLM summary agent |
| `helpscout.escalation_sentiments` | Sentiment values that count as escalation |
| `snowflake.dashboard_query` | Lightweight comment dashboard query |
| `snowflake.demographics_query` | Demographics join query (comment pages) |
| `helpscout.dashboard_query` | Lightweight HelpScout dashboard query |
| `helpscout.demographics_query` | Demographics join query (HelpScout, keyed on email) |
| `demographics.age_groups` | Age bucket definitions (label → [min, max]) |
| `demographics.experience_groups` | Experience bucket definitions |
| `demographics.top_timezones_count` | How many timezones to show in the geographic chart |