Darksight is a dark web monitoring system that scrapes .onion links, analyzes the content using advanced keyword classification, and categorizes websites as illicit or legal. The project leverages machine learning models for classification and provides an interactive interface for users to view results.
- Dark Web Scraping: Securely fetch
.onionweb pages using a proxy server configured with the Tor browser on Kali Linux. - Keyword Analysis: Scrape all keywords from the fetched web pages and analyze their meaning.
- Illicit vs. Legal Classification: Use a fine-tuned BERT model to classify the content of the websites based on scraped keywords.
- Interactive Frontend: Visualize results and interact with the system through a Streamlit-powered interface.
- Python: Core programming language for backend development.
- BERT (Hugging Face): Pre-trained language model fine-tuned for keyword-based classification.
- PyTorch: Framework for fine-tuning the BERT model and implementing ML pipelines.
- Streamlit: Provides a simple and interactive web interface for users to view results.
- Tor Browser: Configured with a proxy server for secure access to the dark web.
- Kali Linux: Operating system used to run the Tor browser and scrape
.onionlinks.
- Clone the Repository:
git clone https://github.com/AJAmit17/DarkSight.git cd darksight - Set Up Environment:
- Install Python 3.9 or above.
- Create a virtual environment and activate it:
python -m venv env source env/bin/activate # For Linux/Mac env\Scripts\activate # For Windows
- Install Dependencies:
pip install -r requirements.txt
- Set Up Tor Proxy:
- Install the Tor browser on Kali Linux and configure it as a proxy server.
- Update the proxy settings in the project to point to your Tor configuration.
- Keyword Classification:
- Run the trained BERT model on the scraped keywords:
python classify_text.py
- Run the trained BERT model on the scraped keywords:
- Launch the Streamlit App:
- View the results in an interactive frontend:
streamlit run app.py
- View the results in an interactive frontend:
- The BERT model is fine-tuned using a dataset of illicit and legal keywords.
- Training scripts are available in the main directory.
- To re-train the model, provide a labeled dataset and run:
python train_model.py
- Scraper: Uses the Tor browser to securely fetch
.onionpages. - Keyword Analysis: Extracts keywords from the fetched pages and processes them for classification.
- Classification Model: Fine-tuned BERT model predicts whether the website content is illicit or legal.
- Frontend: Streamlit application displays the results in an easy-to-use interface.
- Enhance the keyword dataset for improved classification accuracy.
- Add real-time monitoring and alert notifications.
- Incorporate additional ML models for better context understanding.
This project is licensed under the MIT License. See the LICENSE file for details.