Feature Extraction
Transformers
Safetensors
English
mpnet
cybersecurity
classification
fine-tuned
text-embeddings-inference
Instructions to use selfconstruct3d/AttackGroup-MPNET with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use selfconstruct3d/AttackGroup-MPNET with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="selfconstruct3d/AttackGroup-MPNET")# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("selfconstruct3d/AttackGroup-MPNET") model = AutoModelForMultimodalLM.from_pretrained("selfconstruct3d/AttackGroup-MPNET") - Notebooks
- Google Colab
- Kaggle
| library_name: transformers | |
| tags: | |
| - cybersecurity | |
| - mpnet | |
| - classification | |
| - fine-tuned | |
| language: | |
| - en | |
| base_model: | |
| - sentence-transformers/all-mpnet-base-v2 | |
| # AttackGroup-MPNET - Model Card for MPNet Cybersecurity Classifier | |
| This is a fine-tuned MPNet model specialized for classifying cybersecurity threat groups based on textual descriptions of their tactics and techniques. | |
| ## Model Details | |
| ### Model Description | |
| This model is a fine-tuned MPNet classifier specialized in categorizing cybersecurity threat groups based on textual descriptions of their tactics, techniques, and procedures (TTPs). | |
| - **Developed by:** Dženan Hamzić | |
| - **Model type:** Transformer-based classification model (MPNet) | |
| - **Language(s) (NLP):** English | |
| - **License:** Apache-2.0 | |
| - **Finetuned from model:** microsoft/mpnet-base (with intermediate MLM fine-tuning) | |
| ### Model Sources | |
| - **Base Model:** [microsoft/mpnet-base](https://huggingface.co/microsoft/mpnet-base) | |
| ## Uses | |
| ### Direct Use | |
| This model classifies textual cybersecurity descriptions into known cybersecurity threat groups. | |
| ### Downstream Use | |
| Integration into Cyber Threat Intelligence platforms, SOC incident analysis tools, and automated threat detection systems. | |
| ### Out-of-Scope Use | |
| - General language tasks unrelated to cybersecurity | |
| - Tasks outside the cybersecurity domain | |
| ## Bias, Risks, and Limitations | |
| This model specializes in cybersecurity contexts. Predictions for unrelated contexts may be inaccurate. | |
| ### Recommendations | |
| Always verify predictions with cybersecurity analysts before using in critical decision-making scenarios. | |
| ## How to Get Started with the Model (Classification) | |
| ```python | |
| import torch | |
| import torch.nn as nn | |
| from transformers import AutoTokenizer, AutoModelForSequenceClassification | |
| import torch.optim as optim | |
| import numpy as np | |
| from huggingface_hub import hf_hub_download | |
| import json | |
| device = torch.device("cuda" if torch.cuda.is_available() else "cpu") | |
| label_to_groupid_file = hf_hub_download( | |
| repo_id="selfconstruct3d/AttackGroup-MPNET", | |
| filename="label_to_groupid.json" | |
| ) | |
| with open(label_to_groupid_file, "r") as f: | |
| label_to_groupid = json.load(f) | |
| # Load explicitly your fine-tuned MPNet model | |
| classifier_model = AutoModelForSequenceClassification.from_pretrained("selfconstruct3d/AttackGroup-MPNET", num_labels=len(label_to_groupid)).to(device) | |
| # Load explicitly your tokenizer | |
| tokenizer = AutoTokenizer.from_pretrained("selfconstruct3d/AttackGroup-MPNET") | |
| def predict_group(sentence): | |
| classifier_model.eval() | |
| encoding = tokenizer( | |
| sentence, | |
| truncation=True, | |
| padding="max_length", | |
| max_length=128, | |
| return_tensors="pt" | |
| ) | |
| input_ids = encoding["input_ids"].to(device) | |
| attention_mask = encoding["attention_mask"].to(device) | |
| with torch.no_grad(): | |
| outputs = classifier_model(input_ids=input_ids, attention_mask=attention_mask) | |
| logits = outputs.logits | |
| predicted_label = torch.argmax(logits, dim=1).cpu().item() | |
| predicted_groupid = label_to_groupid[str(predicted_label)] | |
| return predicted_groupid | |
| # Example usage explicitly: | |
| sentence = "APT38 has used phishing emails with malicious links to distribute malware." | |
| predicted_class = predict_group(sentence) | |
| print(f"Predicted GroupID: {predicted_class}") | |
| ``` | |
| Predicted GroupID: G0001 | |
| https://attack.mitre.org/groups/G0001/ | |
| ## How to Get Started with the Model (Embeddings) | |
| ```python | |
| import torch | |
| from transformers import AutoTokenizer, AutoModelForSequenceClassification | |
| from huggingface_hub import hf_hub_download | |
| import json | |
| device = torch.device("cuda" if torch.cuda.is_available() else "cpu") | |
| label_to_groupid_file = hf_hub_download( | |
| repo_id="selfconstruct3d/AttackGroup-MPNET", | |
| filename="label_to_groupid.json" | |
| ) | |
| with open(label_to_groupid_file, "r") as f: | |
| label_to_groupid = json.load(f) | |
| # Load your fine-tuned classification model | |
| model_name = "selfconstruct3d/AttackGroup-MPNET" | |
| tokenizer = AutoTokenizer.from_pretrained(model_name) | |
| classifier_model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=len(label_to_groupid)).to(device) | |
| def get_embedding(sentence): | |
| classifier_model.eval() | |
| encoding = tokenizer( | |
| sentence, | |
| truncation=True, | |
| padding="max_length", | |
| max_length=128, | |
| return_tensors="pt" | |
| ) | |
| input_ids = encoding["input_ids"].to(device) | |
| attention_mask = encoding["attention_mask"].to(device) | |
| with torch.no_grad(): | |
| outputs = classifier_model.mpnet(input_ids=input_ids, attention_mask=attention_mask) | |
| cls_embedding = outputs.last_hidden_state[:, 0, :].cpu().numpy().flatten() | |
| return cls_embedding | |
| # Example explicitly: | |
| sentence = "APT38 has used phishing emails with malicious links to distribute malware." | |
| embedding = get_embedding(sentence) | |
| print("Embedding shape:", embedding.shape) | |
| print("Embedding values:", embedding) | |
| ``` | |
| ## Training Details | |
| ### Training Data | |
| To be anounced... | |
| ### Training Procedure | |
| - Fine-tuned from: MLM fine-tuned MPNet ("mpnet_mlm_cyber_finetuned-v2") | |
| - Epochs: 32 | |
| - Learning rate: 5e-6 | |
| - Batch size: 16 | |
| ## Evaluation | |
| ### Testing Data, Factors & Metrics | |
| - **Testing Data:** Stratified sample from original dataset. | |
| - **Metrics:** Accuracy, Weighted F1 Score | |
| ### Results | |
| | Metric | Value | | |
| |------------------------|---------| | |
| | Cl. Accuracy (Test) | 0.9564 | | |
| | W. F1 Score (Test) | 0.9577 | | |
| ## Evaluation Results | |
| | Model | Accuracy | F1 Macro | F1 Weighted | Embedding Variability | | |
| |-----------------------|----------|----------|-------------|-----------------------| | |
| | **AttackGroup-MPNET** | **0.85** | **0.759**| **0.847** | 0.234 | | |
| | GTE Large | 0.66 | 0.571 | 0.667 | 0.266 | | |
| | E5 Large v2 | 0.64 | 0.541 | 0.650 | 0.355 | | |
| | Original MPNet | 0.63 | 0.534 | 0.619 | 0.092 | | |
| | BGE Large | 0.53 | 0.418 | 0.519 | 0.366 | | |
| | SupSimCSE | 0.50 | 0.373 | 0.479 | 0.227 | | |
| | MLM Fine-tuned MPNet | 0.44 | 0.272 | 0.411 | 0.125 | | |
| | SecBERT | 0.41 | 0.315 | 0.410 | 0.591 | | |
| | SecureBERT_Plus | 0.36 | 0.252 | 0.349 | 0.267 | | |
| | CySecBERT | 0.34 | 0.235 | 0.323 | 0.229 | | |
| | ATTACK-BERT | 0.33 | 0.240 | 0.316 | 0.096 | | |
| | Secure_BERT | 0.00 | 0.000 | 0.000 | 0.007 | | |
| | CyBERT | 0.00 | 0.000 | 0.000 | 0.015 | | |
| | Model | Similarity Search Recall@5 | Few-shot Accuracy | In-dist Similarity | OOD Similarity | Robustness Similarity | | |
| |----------------------|----------------------------|-------------------|--------------------|----------------|-----------------------| | |
| | **AttackGroup-MPNET**| **0.934** | **0.857** | 0.235 | 0.017 | 0.948 | | |
| | Original MPNet | 0.786 | 0.643 | 0.217 | -0.004 | 0.941 | | |
| | E5 Large v2 | 0.778 | 0.679 | 0.727 | 0.013 | 0.977 | | |
| | GTE Large | 0.746 | 0.786 | 0.845 | 0.002 | 0.984 | | |
| | BGE Large | 0.632 | 0.750 | 0.533 | -0.006 | 0.970 | | |
| | SupSimCSE | 0.616 | 0.571 | 0.683 | -0.015 | 0.978 | | |
| | SecBERT | 0.468 | 0.429 | 0.586 | -0.001 | 0.970 | | |
| | CyBERT | 0.452 | 0.250 | 1.000 | -0.001 | 1.000 | | |
| | ATTACK-BERT | 0.362 | 0.571 | 0.157 | -0.005 | 0.950 | | |
| | CySecBERT | 0.424 | 0.500 | 0.734 | -0.015 | 0.954 | | |
| | Secure_BERT | 0.424 | 0.250 | 0.990 | 0.050 | 0.998 | | |
| | SecureBERT_Plus | 0.406 | 0.464 | 0.981 | 0.040 | 0.998 | | |
| ### Single Prediction Example | |
| ```python | |
| import torch | |
| import torch.nn as nn | |
| from transformers import AutoTokenizer, AutoModelForSequenceClassification | |
| import torch.optim as optim | |
| import numpy as np | |
| from huggingface_hub import hf_hub_download | |
| import json | |
| device = torch.device("cuda" if torch.cuda.is_available() else "cpu") | |
| # Load explicitly your fine-tuned MPNet model | |
| classifier_model = AutoModelForSequenceClassification.from_pretrained("selfconstruct3d/AttackGroup-MPNET").to(device) | |
| # Load explicitly your tokenizer | |
| tokenizer = AutoTokenizer.from_pretrained("selfconstruct3d/AttackGroup-MPNET") | |
| label_to_groupid_file = hf_hub_download( | |
| repo_id="selfconstruct3d/AttackGroup-MPNET", | |
| filename="label_to_groupid.json" | |
| ) | |
| with open(label_to_groupid_file, "r") as f: | |
| label_to_groupid = json.load(f) | |
| def predict_group(sentence): | |
| classifier_model.eval() | |
| encoding = tokenizer( | |
| sentence, | |
| truncation=True, | |
| padding="max_length", | |
| max_length=128, | |
| return_tensors="pt" | |
| ) | |
| input_ids = encoding["input_ids"].to(device) | |
| attention_mask = encoding["attention_mask"].to(device) | |
| with torch.no_grad(): | |
| outputs = classifier_model(input_ids=input_ids, attention_mask=attention_mask) | |
| logits = outputs.logits | |
| predicted_label = torch.argmax(logits, dim=1).cpu().item() | |
| predicted_groupid = label_to_groupid[str(predicted_label)] | |
| return predicted_groupid | |
| # Example usage explicitly: | |
| sentence = "APT38 has used phishing emails with malicious links to distribute malware." | |
| predicted_class = predict_group(sentence) | |
| print(f"Predicted GroupID: {predicted_class}") | |
| ``` | |
| ## Environmental Impact | |
| Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute). | |
| - **Hardware Type:** [To be filled by user] | |
| - **Hours used:** [To be filled by user] | |
| - **Cloud Provider:** [To be filled by user] | |
| - **Compute Region:** [To be filled by user] | |
| - **Carbon Emitted:** [To be filled by user] | |
| ## Technical Specifications | |
| ### Model Architecture | |
| - MPNet architecture with classification head (768 -> 512 -> num_labels) | |
| - Last 10 transformer layers fine-tuned explicitly | |
| ## Environmental Impact | |
| Carbon emissions should be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute). | |
| ## Model Card Authors | |
| - Dženan Hamzić | |
| ## Model Card Contact | |
| - https://www.linkedin.com/in/dzenan-hamzic/ | |
| ## Licence | |
| This model is licensed for non-commercial use only (CC BY-NC 4.0). | |
| For commercial inquiries, please contact dzenan.hamzic@ait.ac.at. |