Qwen3.5-9B-abliterated-v2-MAX-FP8

Qwen3.5-9B-abliterated-v2-MAX-FP8 is an FP8-compressed variant built on top of prithivMLmods/Qwen3.5-9B-abliterated-v2-MAX. This version leverages BF16 · FP8 (F8_E4M3) precision formats to significantly reduce memory footprint and improve inference efficiency. It maintains the original character of the base model while introducing a more optimized abliteration rate, combining refined refusal direction analysis with enhanced training strategies to further minimize internal refusal behaviors while preserving strong reasoning and instruction-following capabilities. The result is a capable 9B parameter language model optimized for detailed responses and improved instruction adherence, now with enhanced deployment efficiency.

This model is intended strictly for research and learning purposes. Due to reduced internal refusal mechanisms, it may generate sensitive or unrestricted content. Users assume full responsibility for how the model is used. The authors and hosting platform disclaim any liability for generated outputs.

Key Highlights

FP8 Compression (F8_E4M3): Reduces VRAM usage and improves inference throughput while maintaining strong output quality.
BF16 · FP8 Hybrid Precision: Balances numerical stability and performance across model layers.
Optimized Abliteration Rate (v2): Improved suppression of refusal directions with better balance between openness and coherence.
Advanced Refusal Direction Analysis: Identifies and mitigates refusal-related activations within the model’s latent space.
Abliterated v2 Training Strategy: Further reduces refusal behaviors while maintaining response quality and consistency.
9B Parameter Architecture: Based on Qwen3.5-9B, offering strong reasoning with efficient deployment.
Improved Instruction Adherence: Better handling of complex and nuanced prompts with minimal unnecessary refusals.
Efficient Deployment: Ideal for local inference and research workflows with reduced hardware requirements.

Quick Start with Transformers

pip install transformers==5.4.0
# or
pip install git+https://github.com/huggingface/transformers.git

from transformers import Qwen3_5ForConditionalGeneration, AutoProcessor
import torch

model = Qwen3_5ForConditionalGeneration.from_pretrained(
    "prithivMLmods/Qwen3.5-9B-abliterated-v2-MAX-FP8",
    torch_dtype="auto",
    device_map="auto"
)

processor = AutoProcessor.from_pretrained(
    "prithivMLmods/Qwen3.5-9B-abliterated-v2-MAX-FP8"
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Explain how transformer models work in simple terms."}
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

inputs = processor(
    text=[text],
    padding=True,
    return_tensors="pt"
).to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=256)

generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]

output_text = processor.batch_decode(
    generated_ids_trimmed,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)

print(output_text)

Intended Use

Alignment & Refusal Research: Studying abliteration effects under FP8 compression.
Red-Teaming Experiments: Evaluating robustness across adversarial prompts.
Efficient Local Deployment: Running 9B-class models with reduced VRAM usage.
Research Prototyping: Exploring trade-offs between compression, alignment, and reasoning.

Limitations & Risks

Important Note: This model intentionally minimizes built-in safety refusals.

High Risk of Sensitive Outputs: May generate unrestricted or controversial responses.
User Responsibility: Must be used in a safe, ethical, and lawful manner.
Precision Trade-offs: FP8 may introduce minor instability in edge cases.
Abliteration Trade-offs: Increased openness may affect safety alignment or consistency.