Update README.md

944949c verified 8 months ago

11.3 kB

	---
	library_name: transformers
	tags:
	- cybersecurity
	- mpnet
	- classification
	- fine-tuned
	language:
	- en
	base_model:
	- sentence-transformers/all-mpnet-base-v2
	---

	# AttackGroup-MPNET - Model Card for MPNet Cybersecurity Classifier

	This is a fine-tuned MPNet model specialized for classifying cybersecurity threat groups based on textual descriptions of their tactics and techniques.

	## Model Details

	### Model Description

	This model is a fine-tuned MPNet classifier specialized in categorizing cybersecurity threat groups based on textual descriptions of their tactics, techniques, and procedures (TTPs).

	- Developed by: Dženan Hamzić
	- Model type: Transformer-based classification model (MPNet)
	- Language(s) (NLP): English
	- License: Apache-2.0
	- Finetuned from model: microsoft/mpnet-base (with intermediate MLM fine-tuning)

	### Model Sources

	- Base Model: [microsoft/mpnet-base](https://huggingface.co/microsoft/mpnet-base)

	## Uses

	### Direct Use

	This model classifies textual cybersecurity descriptions into known cybersecurity threat groups.

	### Downstream Use

	Integration into Cyber Threat Intelligence platforms, SOC incident analysis tools, and automated threat detection systems.

	### Out-of-Scope Use

	- General language tasks unrelated to cybersecurity
	- Tasks outside the cybersecurity domain

	## Bias, Risks, and Limitations

	This model specializes in cybersecurity contexts. Predictions for unrelated contexts may be inaccurate.

	### Recommendations

	Always verify predictions with cybersecurity analysts before using in critical decision-making scenarios.

	## How to Get Started with the Model (Classification)

	```python
	import torch
	import torch.nn as nn
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch.optim as optim
	import numpy as np
	from huggingface_hub import hf_hub_download
	import json

	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


	label_to_groupid_file = hf_hub_download(
	repo_id="selfconstruct3d/AttackGroup-MPNET",
	filename="label_to_groupid.json"
	)

	with open(label_to_groupid_file, "r") as f:
	label_to_groupid = json.load(f)

	# Load explicitly your fine-tuned MPNet model
	classifier_model = AutoModelForSequenceClassification.from_pretrained("selfconstruct3d/AttackGroup-MPNET", num_labels=len(label_to_groupid)).to(device)

	# Load explicitly your tokenizer
	tokenizer = AutoTokenizer.from_pretrained("selfconstruct3d/AttackGroup-MPNET")

	def predict_group(sentence):
	classifier_model.eval()
	encoding = tokenizer(
	sentence,
	truncation=True,
	padding="max_length",
	max_length=128,
	return_tensors="pt"
	)
	input_ids = encoding["input_ids"].to(device)
	attention_mask = encoding["attention_mask"].to(device)

	with torch.no_grad():
	outputs = classifier_model(input_ids=input_ids, attention_mask=attention_mask)
	logits = outputs.logits
	predicted_label = torch.argmax(logits, dim=1).cpu().item()

	predicted_groupid = label_to_groupid[str(predicted_label)]
	return predicted_groupid

	# Example usage explicitly:
	sentence = "APT38 has used phishing emails with malicious links to distribute malware."
	predicted_class = predict_group(sentence)
	print(f"Predicted GroupID: {predicted_class}")
	```
	Predicted GroupID: G0001
	https://attack.mitre.org/groups/G0001/


	## How to Get Started with the Model (Embeddings)

	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	from huggingface_hub import hf_hub_download
	import json

	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


	label_to_groupid_file = hf_hub_download(
	repo_id="selfconstruct3d/AttackGroup-MPNET",
	filename="label_to_groupid.json"
	)

	with open(label_to_groupid_file, "r") as f:
	label_to_groupid = json.load(f)


	# Load your fine-tuned classification model
	model_name = "selfconstruct3d/AttackGroup-MPNET"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	classifier_model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=len(label_to_groupid)).to(device)

	def get_embedding(sentence):
	classifier_model.eval()

	encoding = tokenizer(
	sentence,
	truncation=True,
	padding="max_length",
	max_length=128,
	return_tensors="pt"
	)
	input_ids = encoding["input_ids"].to(device)
	attention_mask = encoding["attention_mask"].to(device)

	with torch.no_grad():
	outputs = classifier_model.mpnet(input_ids=input_ids, attention_mask=attention_mask)
	cls_embedding = outputs.last_hidden_state[:, 0, :].cpu().numpy().flatten()

	return cls_embedding

	# Example explicitly:
	sentence = "APT38 has used phishing emails with malicious links to distribute malware."
	embedding = get_embedding(sentence)
	print("Embedding shape:", embedding.shape)
	print("Embedding values:", embedding)
	```



	## Training Details

	### Training Data

	To be anounced...

	### Training Procedure

	- Fine-tuned from: MLM fine-tuned MPNet ("mpnet_mlm_cyber_finetuned-v2")
	- Epochs: 32
	- Learning rate: 5e-6
	- Batch size: 16

	## Evaluation

	### Testing Data, Factors & Metrics

	- Testing Data: Stratified sample from original dataset.
	- Metrics: Accuracy, Weighted F1 Score

	### Results

	\| Metric \| Value \|
	\|------------------------\|---------\|
	\| Cl. Accuracy (Test) \| 0.9564 \|
	\| W. F1 Score (Test) \| 0.9577 \|


	## Evaluation Results

	\| Model \| Accuracy \| F1 Macro \| F1 Weighted \| Embedding Variability \|
	\|-----------------------\|----------\|----------\|-------------\|-----------------------\|
	\| AttackGroup-MPNET \| 0.85 \| 0.759\| 0.847 \| 0.234 \|
	\| GTE Large \| 0.66 \| 0.571 \| 0.667 \| 0.266 \|
	\| E5 Large v2 \| 0.64 \| 0.541 \| 0.650 \| 0.355 \|
	\| Original MPNet \| 0.63 \| 0.534 \| 0.619 \| 0.092 \|
	\| BGE Large \| 0.53 \| 0.418 \| 0.519 \| 0.366 \|
	\| SupSimCSE \| 0.50 \| 0.373 \| 0.479 \| 0.227 \|
	\| MLM Fine-tuned MPNet \| 0.44 \| 0.272 \| 0.411 \| 0.125 \|
	\| SecBERT \| 0.41 \| 0.315 \| 0.410 \| 0.591 \|
	\| SecureBERT_Plus \| 0.36 \| 0.252 \| 0.349 \| 0.267 \|
	\| CySecBERT \| 0.34 \| 0.235 \| 0.323 \| 0.229 \|
	\| ATTACK-BERT \| 0.33 \| 0.240 \| 0.316 \| 0.096 \|
	\| Secure_BERT \| 0.00 \| 0.000 \| 0.000 \| 0.007 \|
	\| CyBERT \| 0.00 \| 0.000 \| 0.000 \| 0.015 \|


	\| Model \| Similarity Search Recall@5 \| Few-shot Accuracy \| In-dist Similarity \| OOD Similarity \| Robustness Similarity \|
	\|----------------------\|----------------------------\|-------------------\|--------------------\|----------------\|-----------------------\|
	\| AttackGroup-MPNET\| 0.934 \| 0.857 \| 0.235 \| 0.017 \| 0.948 \|
	\| Original MPNet \| 0.786 \| 0.643 \| 0.217 \| -0.004 \| 0.941 \|
	\| E5 Large v2 \| 0.778 \| 0.679 \| 0.727 \| 0.013 \| 0.977 \|
	\| GTE Large \| 0.746 \| 0.786 \| 0.845 \| 0.002 \| 0.984 \|
	\| BGE Large \| 0.632 \| 0.750 \| 0.533 \| -0.006 \| 0.970 \|
	\| SupSimCSE \| 0.616 \| 0.571 \| 0.683 \| -0.015 \| 0.978 \|
	\| SecBERT \| 0.468 \| 0.429 \| 0.586 \| -0.001 \| 0.970 \|
	\| CyBERT \| 0.452 \| 0.250 \| 1.000 \| -0.001 \| 1.000 \|
	\| ATTACK-BERT \| 0.362 \| 0.571 \| 0.157 \| -0.005 \| 0.950 \|
	\| CySecBERT \| 0.424 \| 0.500 \| 0.734 \| -0.015 \| 0.954 \|
	\| Secure_BERT \| 0.424 \| 0.250 \| 0.990 \| 0.050 \| 0.998 \|
	\| SecureBERT_Plus \| 0.406 \| 0.464 \| 0.981 \| 0.040 \| 0.998 \|



	### Single Prediction Example

	```python

	import torch
	import torch.nn as nn
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch.optim as optim
	import numpy as np
	from huggingface_hub import hf_hub_download
	import json

	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	# Load explicitly your fine-tuned MPNet model
	classifier_model = AutoModelForSequenceClassification.from_pretrained("selfconstruct3d/AttackGroup-MPNET").to(device)

	# Load explicitly your tokenizer
	tokenizer = AutoTokenizer.from_pretrained("selfconstruct3d/AttackGroup-MPNET")


	label_to_groupid_file = hf_hub_download(
	repo_id="selfconstruct3d/AttackGroup-MPNET",
	filename="label_to_groupid.json"
	)

	with open(label_to_groupid_file, "r") as f:
	label_to_groupid = json.load(f)

	def predict_group(sentence):
	classifier_model.eval()
	encoding = tokenizer(
	sentence,
	truncation=True,
	padding="max_length",
	max_length=128,
	return_tensors="pt"
	)
	input_ids = encoding["input_ids"].to(device)
	attention_mask = encoding["attention_mask"].to(device)

	with torch.no_grad():
	outputs = classifier_model(input_ids=input_ids, attention_mask=attention_mask)
	logits = outputs.logits
	predicted_label = torch.argmax(logits, dim=1).cpu().item()

	predicted_groupid = label_to_groupid[str(predicted_label)]
	return predicted_groupid

	# Example usage explicitly:
	sentence = "APT38 has used phishing emails with malicious links to distribute malware."
	predicted_class = predict_group(sentence)
	print(f"Predicted GroupID: {predicted_class}")
	```

	## Environmental Impact

	Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute).

	- Hardware Type: [To be filled by user]
	- Hours used: [To be filled by user]
	- Cloud Provider: [To be filled by user]
	- Compute Region: [To be filled by user]
	- Carbon Emitted: [To be filled by user]

	## Technical Specifications

	### Model Architecture

	- MPNet architecture with classification head (768 -> 512 -> num_labels)
	- Last 10 transformer layers fine-tuned explicitly

	## Environmental Impact

	Carbon emissions should be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute).

	## Model Card Authors

	- Dženan Hamzić

	## Model Card Contact

	- https://www.linkedin.com/in/dzenan-hamzic/


	## Licence
	This model is licensed for non-commercial use only (CC BY-NC 4.0).
	For commercial inquiries, please contact dzenan.hamzic@ait.ac.at.