A professional hate speech detection system using machine learning with multiple model architectures and comprehensive evaluation tools.
- Multiple Model Types: Random Forest, Logistic Regression, and SVM classifiers
- Enhanced Dataset Generation: Creates synthetic and contextual data for better training
- Professional Text Preprocessing: Advanced NLP techniques for text cleaning
- Comprehensive Evaluation: Detailed performance analysis and visualization
- Easy-to-Use Interface: Command-line tools for training and prediction
- Production Ready: Save/load models for deployment
pip install pandas numpy scikit-learn matplotlib seaborn nltk textblobpython simple_model.py# Interactive mode
python simple_predict.py --interactive
# Single text prediction
python simple_predict.py --text "Your text here"
# Quick demo with examples
python quick_demo.pyPython/HateSpeech/
├── requirements.txt # Dependencies
├── simple_model.py # Scikit-learn based models
├── simple_predict.py # Prediction interface
├── quick_demo.py # Demo script
├── data_generator.py # Dataset generation (TensorFlow version)
├── preprocessor.py # Text preprocessing utilities
├── model.py # TensorFlow models (advanced)
├── train.py # TensorFlow training pipeline
├── predict.py # TensorFlow prediction interface
├── evaluate.py # Model evaluation and analysis
├── README.md # This file
├── dataset_tweet.csv # Original dataset
├── simple_hate_speech_detector.pkl # Trained scikit-learn model
└── enhanced_dataset.csv # Enhanced dataset (generated)
The current scikit-learn model achieves:
- Overall Accuracy: 89.08%
- Hate Speech Detection: 50% precision, 13% recall
- Offensive Language Detection: 91% precision, 96% recall
- Neutral Text Detection: 81% precision, 85% recall
from simple_model import SimpleHateSpeechDetector
# Random Forest (default)
detector = SimpleHateSpeechDetector(model_type='random_forest')
# Logistic Regression
detector = SimpleHateSpeechDetector(model_type='logistic_regression')
# Support Vector Machine
detector = SimpleHateSpeechDetector(model_type='svm')from simple_model import SimpleHateSpeechDetector
# Load trained model
detector = SimpleHateSpeechDetector()
detector.load_model('simple_hate_speech_detector.pkl')
# Single prediction
result = detector.predict_single("Your text here")
print(f"Class: {result['predicted_class']}")
print(f"Confidence: {result['confidence']:.4f}")
# Batch prediction
texts = ["Text 1", "Text 2", "Text 3"]
predictions, probabilities = detector.predict(texts)The system classifies text into three categories:
- Hate Speech: Content that promotes violence or discrimination against groups
- Offensive Language: Profanity, insults, or inappropriate content
- Neither: Normal, non-offensive content
| Text | Prediction | Confidence |
|---|---|---|
| "I love this beautiful day!" | Neither | 50.5% |
| "Fuck you, you piece of shit" | Offensive Language | 90.0% |
| "Kill all the Jews" | Hate Speech | 67.1% |
| "The weather is nice today" | Neither | 69.9% |
For more advanced models using TensorFlow:
- Install TensorFlow:
pip install tensorflow- Use the advanced training pipeline:
python train.py- Advanced prediction:
python predict.py --interactive- Random Forest: Ensemble method with good generalization
- Logistic Regression: Linear model with interpretable results
- Support Vector Machine: Effective for high-dimensional data
- LSTM: Bidirectional LSTM for sequential text understanding
- CNN: Convolutional layers for local pattern detection
- Transformer: Multi-head attention for state-of-the-art performance
The system includes data generation capabilities:
- Synthetic Data: Generated using hate speech, offensive, and neutral phrase templates
- Contextual Data: Context-aware hate speech patterns with demographic groups
- Original Data: Twitter dataset with manual annotations
- Use GPU acceleration for TensorFlow models
- Adjust batch size based on available memory
- Use early stopping to prevent overfitting
- Experiment with different model architectures
- Memory Issues: Reduce vocabulary size or batch size
- Slow Training: Use smaller models or fewer epochs
- Poor Performance: Increase dataset size or try different model architecture
- NLTK Errors: Ensure NLTK data is downloaded
If you encounter issues with TensorFlow installation:
- Use the scikit-learn version (simple_model.py)
- Install TensorFlow CPU version:
pip install tensorflow-cpu - Use conda for better dependency management
To enhance the system:
- Add new model architectures in
simple_model.py - Implement additional preprocessing techniques in
preprocessor.py - Create new dataset generators in
data_generator.py - Add evaluation metrics in
evaluate.py
This project is for educational and research purposes. Please ensure compliance with data usage and privacy regulations when deploying in production environments.
✅ Working Implementation: Scikit-learn based hate speech detector ✅ Trained Model: 89% accuracy on test data ✅ Interactive Interface: Command-line prediction tool ✅ Demo Script: Quick testing with example texts 🔄 Advanced Models: TensorFlow implementation available (requires TensorFlow installation)