Skip to content

Incomplete/Missing Content in Retrieval for Android Project - Embedder Configuration Recommendations #375

@CiaShangLin

Description

@CiaShangLin

Current Configuration

I'm currently using the following embedder configuration for a medium-sized Android project:

{
  "embedder": {
    "client_class": "OpenAIClient",
    "batch_size": 500,
    "model_kwargs": {
      "model": "text-embedding-3-small",
      "dimensions": 256,
      "encoding_format": "float"
    }
  },
  "retriever": {
    "top_k": 20
  },
  "text_splitter": {
    "split_by": "word",
    "chunk_size": 350,
    "chunk_overlap": 100
  }
}

LLM Model: GPT-4o

Problem Description

When reading through my private Android project using this configuration, the parsing and retrieval appears to be missing significant content. The responses seem incomplete and lack important context from the codebase.

Questions

  1. Embedding Dimensions: Is 256 dimensions too low for code embeddings? Should I increase this for better semantic understanding of Android code?

  2. Chunk Size: Is chunk_size: 350 words appropriate for Android projects (Java/Kotlin files with classes, methods, XML layouts, etc.)? Should this be adjusted?

  3. Retriever Settings: Is top_k: 20 sufficient, or should it be increased for better context coverage?

  4. Model Selection: Would switching to text-embedding-3-large provide better results for code understanding?

  5. Android-Specific Considerations: Are there recommended configurations specifically optimized for Android projects that better handle:

    • Java/Kotlin code
    • XML layouts and resources
    • Gradle build files
    • AndroidManifest.xml
    • Multi-module project structure

Expected Behavior

The embedder should capture and retrieve comprehensive context from the Android codebase, including:

  • Class implementations and their relationships
  • Method implementations and logic
  • Resource files (layouts, strings, etc.)
  • Build configuration
  • Project architecture and dependencies

Environment

  • Project Type: Medium-sized Android project
  • Primary Languages: Java/Kotlin
  • Project Structure: Multi-module (assumed)
  • Current Model: GPT-4o with OpenAI text-embedding-3-small

Requested Recommendations

What embedder configuration would you recommend for optimal performance with Android projects? Specifically:

  • Optimal dimensions value
  • Recommended chunk_size and chunk_overlap
  • Appropriate top_k value
  • Alternative embedding models if applicable
  • Any Android-specific tuning parameters

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions