GoByte is a fast PCAP/PCAPNG parser built in Go for preprocessing network traffic data for deep learning models. It extracts packet bytes and exports them in multiple formats (CSV, Parquet, or NumPy).
________ __________ __
/ _____/ ____\______ \___.__._/ |_ ____
/ \ ___ / _ \| | _< | |\ __\/ __ \
\ \_\ ( <_> ) | \\___ | | | \ ___/
\______ /\____/|______ // ____| |__| \___ >
\/ \/ \/ \/
Fast PCAP Parser for Deep Learning | Network Traffic Preprocessing
- High Performance: Concurrent packet processing using Go's goroutines
- Memory Efficient: Streaming mode for processing large datasets with minimal RAM usage
- Multiple Formats: Export to CSV, Parquet, or NumPy (binary format optimized for ML/DL)
- Class Labels: Automatic labeling from directory structure (e.g.,
dataset/malware/*.pcap) - Batch Processing: Process multiple PCAP files in parallel
- Flexible Output: Fixed-length padding/truncation or variable-length packets
- Privacy Protection: Optional IP address masking for anonymization
- Deep Learning Ready: Direct output for PyTorch, TensorFlow, scikit-learn
- PCAP/PCAPNG Support: Handles both formats automatically
- Network Traffic Classification: Application identification, IoT device classification
- Intrusion Detection Systems: Anomaly detection, signature-based detection
- Network Behavior Analysis: Protocol analysis, traffic profiling
- Deep Learning Preprocessing: CNN, RNN, Transformer models for packet analysis
Download the latest release for your platform:
- Linux (AMD64): gobyte-linux-amd64
- Linux (ARM64): gobyte-linux-arm64
- macOS (Intel): gobyte-darwin-amd64
- macOS (Apple Silicon): gobyte-darwin-arm64
- Windows (AMD64): gobyte-windows-amd64.exe
# Download the binary (replace with your machine platform)
wget https://github.com/afifhaziq/GoByte/releases/latest/download/gobyte-linux-amd64
# Make it executable
chmod +x gobyte-linux-amd64
# Move to PATH (optional)
sudo mv gobyte-linux-amd64 /usr/local/bin/gobyte
# Verify installation
gobyte --help- Download
gobyte-windows-amd64.exefrom Releases - Rename to
gobyte.exe - Add to your PATH or run from the download directory
Requirements:
- Go 1.23 or higher
- libpcap development headers
# Ubuntu/Debian
sudo apt-get install libpcap-dev
# Fedora/CentOS/RHEL
sudo yum install libpcap-devel
# Arch Linux
sudo pacman -S libpcap# libpcap is pre-installed on macOS
# If needed, install via Homebrew:
brew install libpcap# Clone the repository
git clone https://github.com/afifhaziq/GoByte.git
cd GoByte
# Download Go dependencies
go mod download
# Build the binary
go build -o gobyte .
# Run
./gobyte --helpIf you're new to network traffic analysis and don't have PCAP files to start with, download MalayaNetwork_GT, our ready-to-use dataset.
The easiest way to install this dataset is to use the uv package manager (No virtual environment required). Click this link to install it .
# Install uv (if not already installed)
# Download the MalayaNetwork_GT dataset (13.5 GB, 10 traffic classes)
uvx hf download Afifhaziq/MalayaNetwork_GT \
--repo-type dataset \
--local-dir ./dataset \
--include "PCAP/*"
# Process the entire dataset with GoByte (streaming mode for large datasets)
gobyte --dataset dataset/PCAP --format numpy --length 1500 --streaming --output dataset.npy
# Or use parquet format:
# gobyte --dataset dataset/PCAP --format parquet --length 1500 --streamingThis will download network traffic from 10 application classes:
- Bittorent
- ChromeRDP
- Discord
- EA Origin
- Microsoft Teams
- Slack
- Steam
- Teamviewer
- Webex
- Zoom
Usage: gobyte [options]
Options:
--input string
Input PCAP file path (single file mode)
--dataset string
Dataset directory with class subdirectories (multi-file mode)
--format string
Output format: csv, parquet, or numpy (default "csv")
--output string
Output file path (default: output.csv, output.parquet, or output.npy based on format)
--length int
Desired length of output bytes (pad/truncate). 0 = keep original size (default: 0)
--sort
Retain packets order. Set to false to shuffle (default: true)
--concurrent int
Max concurrent files to process (multi-file mode) (default: 2)
--streaming
Use streaming mode for memory efficiency (default: true)
--per-file
Create separate output file for each input file (dataset mode only)
--ipmask
Mask source and destination IP addresses
Memory Optimization:
--streaming Stream packets to disk (default: true, ~200-300MB RAM)
--streaming=false Load all packets in memory (WARNING: can cause OOM for large datasets)
--per-file Create one output per input file (lowest memory, parallel)
Note: Streaming mode is enabled by default to prevent OOM errors.
Use --streaming=false for in-memory processing (only recommended for small files).
Process a single PCAP file and export to CSV:
gobyte --input traffic.pcap --output results.csvProcess and pad/truncate packets to fixed length:
gobyte --input traffic.pcap --length 1480 --format numpyMask IP addresses for privacy:
gobyte --input traffic.pcap --ipmask --format numpyOrganize your dataset like this:
dataset/
├── benign/
│ ├── file1.pcap
│ └── file2.pcap
├── malware/
│ ├── file1.pcap
│ └── file2.pcap
└── ddos/
├── file1.pcap
└── file2.pcap
Process all files and automatically assign class labels:
gobyte --dataset dataset --format numpy --length 720For large datasets, use streaming mode:
gobyte --dataset dataset --format numpy --length 720 --streamingNote: Labels are automatically extracted from directory names. You may need to encode them numerically before training except for numpy format.
Example 1: Basic CSV Export
gobyte --input data.pcap
# Output: output/output.csvExample 2: Fixed-Length Packets (Recommended for Deep Learning)
gobyte --input data.pcap --length 1500 --format parquet
# All packets padded/truncated to exactly 1500 bytes
# Parquet works best with fixed-length packetsExample 3: Multi-File Dataset with Labels
gobyte --dataset my_dataset --format parquet --concurrent 4
# Processes multiple files in parallel and assigns labels from directory namesExample 4: Large Dataset with Streaming Mode
gobyte --dataset my_dataset --format parquet --length 1500 --streaming
# Memory-efficient processing for large datasets (uses ~200-300MB RAM)Example 5: Per-File Output Mode
gobyte --dataset my_dataset --format parquet --per-file
# Creates separate output file for each input file (maximum memory efficiency)Example 6: Variable-Length Packets
gobyte --input data.pcap --length 0 --format csv
# Keeps original packet sizes (no padding/truncation)
# Note: Use CSV for variable-length packets (Parquet is slow for variable-length)Example 7: IP Address Masking for Privacy
gobyte --input data.pcap --ipmask --format parquet
# Masks source and destination IP addresses (sets them to 0.0.0.0)Example 8: NumPy Format (Recommended for ML/DL)
gobyte --dataset my_dataset --format numpy --length 1500 --streaming --output dataset.npy
# Exports to NumPy format: 3-4x smaller than CSV, 10-20x faster loading
# Output: dataset_data.npy, dataset_labels.npy, dataset_classes.jsonSee example/README.md for detailed NumPy usage examples and DL framework integration.
Example 9: Combined Options
gobyte --dataset my_dataset --length 720 --ipmask --streaming --format parquet
# Fixed-length packets with IP masking in memory-efficient streaming mode- Human-readable text format
- Large file sizes
- Compatible with all data analysis tools
- Recommended for variable-length packets (
--length 0) - Fast and memory-efficient for all packet sizes
- Compressed columnar format
- 10-20x smaller than CSV
- Optimized for ML frameworks (PyTorch, TensorFlow)
- Best with
--lengthflag (e.g.,--length 1500) - Variable-length Parquet is slow and memory-intensive - use CSV instead for variable-length data
- Binary format (
.npyfiles) - 3-4x smaller than CSV (18 GB vs 50-70 GB for same dataset)
- 10-20x faster loading (2-5 seconds vs 30-60 seconds)
- Native ML/DL integration - zero-copy with PyTorch, TensorFlow, JAX
- Memory-efficient streaming mode (~200-300 MB RAM)
- Outputs:
*_data.npy(packet data),*_labels.npy(class labels),*_classes.json(mapping)
For detailed NumPy usage, examples, and ML framework integration, see example/README.md.
Dataset Specifications:
- Total Files: 31 PCAP files
- Total Size: 13 GB
- Total Packets: 12,584,314 packets
- Classes: 10 application types (Bittorent, ChromeRDP, Discord, EA Origin, Microsoft Teams, Slack, Steam, Teamviewer, Webex, Zoom)
Processing the entire MalayaNetwork_GT dataset (12.5M packets) with NumPy format in streaming mode:
Note: These results are based on the end to end time of preprocessing + exporting time of the dataset. If you prefer faster processing time, you can fork and remove the exporting process in writer.go.
| Metric | Result |
|---|---|
| Processing Time | 16.2 seconds |
| Throughput | ~776,000 packets/second |
| Peak Memory Usage | 55 MB |
| Average Memory Usage | 50 MB |
| Output Size | 18 GB (data.npy) + 13 MB (labels.npy) |
| Size Reduction | 3-4x smaller than CSV equivalent |
Key Achievements:
- Processed 12.5 million packets in just 16 seconds
- Used only 55 MB peak RAM (lightweight and memory-efficient)
- Generated 18 GB NumPy output (vs ~50-70 GB for CSV)
- Maintained ~776K packets/second throughput throughout
Processing a single 24 MB PCAP file (30,885 packets):
| Format | Output Size | Processing Time | Peak RAM |
|---|---|---|---|
| CSV (variable-length) | 80 MB | 675 ms | 39 MB |
| Parquet (variable-length) | 23 MB | 163 ms | 81 MB |
| CSV (fixed 1500 bytes) | 123 MB | 915 ms | 39 MB |
| Mode | Peak RAM | Use Case |
|---|---|---|
| Streaming Mode | 50-55 MB | Large datasets (recommended) |
| Per-File Mode | 68-92 MB | Maximum parallelism |
| Batch Mode | Variable | Small datasets only |
View the full benchmark results and test suite: benchmark_results.log
Fixed-Length (--length 1500):
Byte_0, Byte_1, Byte_2, ..., Byte_1499, Class
69, 0, 0, ..., 0, benign
69, 0, 0, ..., 0, malware
Variable-Length (--length 0):
Byte_0, Byte_1, Byte_2, ..., Class
69, 0, 0, benign
69, 0, 0, 52, malware
GoByte/
├── main.go # CLI entry point and orchestration
├── parser.go # PCAP parsing and concurrent processing
├── writer.go # CSV/Parquet/NumPy export with streaming support
├── packet_utils.go # Packet processing utilities
├── go.mod # Go module definition
├── go.sum # Dependency checksums
├── example/ # NumPy usage examples and utilities
├── dataset/ # Sample PCAP files (not committed)
├── output/ # Generated output files (not committed)
└── .github/
└── workflows/
└── release.yml # Automated binary builds
# github.com/google/gopacket/pcap
../go/pkg/mod/github.com/google/[email protected]/pcap/pcap_unix.go:34:10: fatal error: pcap.h: No such file or directory
34 | #include <pcap.h>
| ^~~~~~~~
compilation terminated.Solution: Install the libpcap development headers. Follow the instructions in the Installation section.
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/feature1) - Run tests and formatting (
go test ./... && go fmt ./...) - Commit your changes (
git commit -m 'Feature: Hello there') - Push to the branch (
git push origin feature/feature1) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.