Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the observability and analysis capabilities for context usage within the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a suite of tools for analyzing context usage from logs, including a new log line for capturing context length statistics and new scripts: context_log.py for parsing prompt logs, context_stat.py for statistical analysis and visualization, and context_stat_tool.py as a command-line interface. While these tools are useful, they introduce significant security vulnerabilities. Specifically, there is a Log Injection vulnerability in aworld/logs/prompt_log.py where untrusted user input is logged using specific separators (| and │) without sanitization, potentially allowing attackers to spoof log entries. Additionally, the new CLI tools in aworld/logs/tools/ are vulnerable to Path Traversal, as they accept unvalidated file paths and output directories from command-line arguments, which could lead to arbitrary file writes. Beyond security, a critical issue was identified in the log parsing logic of context_log.py that could lead to incorrect data processing when handling large files. Several medium-severity issues also exist, related to code cleanliness, maintainability (e.g., unused imports and inconsistent comments), and the usability of the example script. Addressing these security and functional concerns is crucial before merging.
| def read_file_chunks(self) -> List[List[str]]: | ||
| """ | ||
| 读取文件并分块处理 | ||
|
|
||
| Returns: | ||
| 文件内容分块列表 | ||
| """ | ||
| logger.info(f"🚀 开始读取日志文件: {self.log_file_path}") | ||
|
|
||
| try: | ||
| with open(self.log_file_path, 'r', encoding='utf-8') as file: | ||
| lines = [] | ||
| chunks = [] | ||
|
|
||
| for line in file: | ||
| lines.append(line.strip()) | ||
|
|
||
| # 当达到分块大小时,处理当前块 | ||
| if len(lines) >= self.chunk_size: | ||
| chunks.append(lines.copy()) | ||
| lines.clear() | ||
| logger.debug(f"📦 处理文件块,当前块数: {len(chunks)}") | ||
|
|
||
| # 处理最后一块 | ||
| if lines: | ||
| chunks.append(lines) | ||
| logger.debug(f"📦 处理最后文件块,总块数: {len(chunks)}") | ||
|
|
||
| logger.info(f"✅ 文件读取完成,共分 {len(chunks)} 个块") | ||
| return chunks | ||
|
|
||
| except Exception as e: | ||
| logger.error(f"❌ 读取文件失败: {e}") | ||
| raise |
There was a problem hiding this comment.
The current implementation of read_file_chunks splits the log file into fixed-size chunks of lines. This can cause a single log record (which can span multiple lines) to be split across two different chunks. When this happens, parse_chunk and parse_context_record will fail to parse the record correctly, potentially leading to missed records or incomplete data. This is a critical issue for a log parser.
A more robust approach would be to make the chunking logic aware of record boundaries. For example, you could read a block of lines, find the last record start marker (🚀 AGENT EXECUTION START), and process the chunk up to that point. The remaining lines would then be prepended to the next block of lines read from the file. This would ensure that parse_context_record always receives complete records. However, this would make parallel processing with ThreadPoolExecutor more complex as it would introduce sequential dependency.
An alternative for parallel processing is to have each worker seek to a calculated offset in the file, then scan backwards to the first record separator to find a clean starting point. This avoids splitting records.
| digest_logger.info( | ||
| f"context_length|{agent.id()}|{context.task_id}|{getattr(context, 'user', 'default')}|{context.session_id}|{model_name}|{total_context_length}|{json.dumps(token_breakdown)}|{len(messages)}") |
There was a problem hiding this comment.
The digest_logger writes structured data using the | separator. However, fields like agent.id(), context.task_id, and model_name are not sanitized for this separator. If any of these fields contain a | character, it will corrupt the log entry and cause the context_stat.py parser to misalign fields or fail. Additionally, if these fields contain newlines, an attacker could inject entirely new log entries.
| if not output_path.suffix.lower() == '.txt': | ||
| output_path = output_path.with_suffix('.txt') | ||
|
|
||
| with open(output_path, 'w', encoding='utf-8') as f: |
There was a problem hiding this comment.
The save_record_to_file function uses the output_file path provided via command-line arguments without validation. This allows for arbitrary file write via path traversal (e.g., using ../../ in the -o or --output-file argument). An attacker running this tool could overwrite sensitive files the user has access to.
|
|
||
| # Save or show plot | ||
| if save_path: | ||
| plt.savefig(save_path, dpi=300, bbox_inches='tight') |
| out_dir = os.path.abspath(args.output_dir) | ||
| if out_dir != "." and not os.path.isdir(out_dir): | ||
| os.makedirs(out_dir, exist_ok=True) |
| prompt_logger.info(f"│ {line}") | ||
|
|
||
| prompt_logger.info(_generate_separator()) | ||
| if messages and len(messages) > 0 and messages[-1].get('role') != 'assistant': |
There was a problem hiding this comment.
The condition messages and len(messages) > 0 is redundant. In Python, a non-empty list evaluates to True in a boolean context, so you can simplify this to just if messages and ... for better readability.
| if messages and len(messages) > 0 and messages[-1].get('role') != 'assistant': | |
| if messages and messages[-1].get('role') != 'assistant': |
| """ | ||
| 日志解析器 - 从amnicontext_prompt.log中根据Context ID获取最后一条task_id记录 | ||
| 支持并发处理和批次优化 | ||
| """ |
| import pandas as pd | ||
| import numpy as np |
| import glob | ||
| import os |
| from aworld.logs.tools.context_stat_tool import stat_log | ||
|
|
||
|
|
||
| log_file = "yourpath/logs/digest_logger.log" |
There was a problem hiding this comment.
The log_file is hardcoded to a placeholder path "yourpath/logs/digest_logger.log", which will cause the script to fail if not edited by the user. For a better user experience, consider using a more robust way to determine the log file path. For example, you could use a relative path to an example log file within the repository, or accept a command-line argument.
No description provided.