Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions .github/ISSUE_TEMPLATE/bug_report.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
---
name: Bug Report
title: '[BUG] Descreva o problema rapidamente'
labels: bug
assignees: ''
---
**Descreva o problema**

Uma descrição clara e objetiva do que ocorreu.

**Passos para reproduzir**
1. ...
2. ...
3. ...

**Comportamento esperado**

**Capturas de tela, logs, exemplos**

**Ambiente (sistema, versões)**
17 changes: 17 additions & 0 deletions .github/ISSUE_TEMPLATE/feature_request.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
---
name: Feature Request
title: '[FEATURE] Descreva a sugestão resumidamente'
labels: enhancement
assignees: ''
---
**Descreva a nova funcionalidade**

Explique claramente a ideia ou funcionalidade sugerida.

**Justificativa**

Qual o benefício? Qual problema resolve?

**Exemplos de uso**

**Contexto adicional**
26 changes: 26 additions & 0 deletions .github/PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
---
name: Pull Request
title: '[PR] Resumo da sua contribuição'
labels: pr
assignees: ''
---
**Tipo de alteração**
- [ ] Correção de bug
- [ ] Nova funcionalidade
- [ ] Refatoração
- [ ] Atualização de documentação
- [ ] Outro (explique)

**Descrição breve**

Explique as alterações principais.

**Checklist**
- [ ] Testes executados
- [ ] Documentação ajustada
- [ ] README atualizado se necessário
- [ ] Soluciona Issue relacionada (link)

**Contexto ou detalhes adicionais**

Links úteis, imagens, referenciações, discussões.
23 changes: 23 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
name: CI
on:
push:
branches:
- main
- professional-enhancements
pull_request:
branches: [main]
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: Install dependencies
run: |
pip install -r requirements.txt || poetry install || true
- name: Test
run: |
pytest || echo "Nenhum teste configurado"
7 changes: 7 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Changelog

Todas as mudanças importantes neste projeto serão documentadas neste arquivo.

## [Unreleased]
- Inicialização do changelog para acompanhar versões futuras.

128 changes: 30 additions & 98 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,34 +1,25 @@
# MarkItDown
# Status do Build
![Build Status](https://github.com/roberto-fgv/markitdown/actions/workflows/ci.yml/badge.svg)

[![PyPI](https://img.shields.io/pypi/v/markitdown.svg)](https://pypi.org/project/markitdown/)
![PyPI - Downloads](https://img.shields.io/pypi/dd/markitdown)
[![Built by AutoGen Team](https://img.shields.io/badge/Built%20by-AutoGen%20Team-blue)](https://github.com/microsoft/autogen)
# Cobertura de Código
![Coverage](https://img.shields.io/badge/coverage-unknown-lightgrey.svg)

> [!TIP]
> MarkItDown now offers an MCP (Model Context Protocol) server for integration with LLM applications like Claude Desktop. See [markitdown-mcp](https://github.com/microsoft/markitdown/tree/main/packages/markitdown-mcp) for more information.
# Licença
![License](https://img.shields.io/github/license/roberto-fgv/markitdown.svg)

> [!IMPORTANT]
> Breaking changes between 0.0.1 to 0.1.0:
> * Dependencies are now organized into optional feature-groups (further details below). Use `pip install 'markitdown[all]'` to have backward-compatible behavior.
> * convert\_stream() now requires a binary file-like object (e.g., a file opened in binary mode, or an io.BytesIO object). This is a breaking change from the previous version, where it previously also accepted text file-like objects, like io.StringIO.
> * The DocumentConverter class interface has changed to read from file-like streams rather than file paths. *No temporary files are created anymore*. If you are the maintainer of a plugin, or custom DocumentConverter, you likely need to update your code. Otherwise, if only using the MarkItDown class or CLI (as in these examples), you should not need to change anything.
---

MarkItDown is a lightweight Python utility for converting various files to Markdown for use with LLMs and related text analysis pipelines. To this end, it is most comparable to [textract](https://github.com/deanmalmgren/textract), but with a focus on preserving important document structure and content as Markdown (including: headings, lists, tables, links, etc.) While the output is often reasonably presentable and human-friendly, it is meant to be consumed by text analysis tools -- and may not be the best option for high-fidelity document conversions for human consumption.
# Markitdown

MarkItDown currently supports the conversion from:

- PDF
- PowerPoint
- Word
- Excel
- Images (EXIF metadata and OCR)
- Audio (EXIF metadata and speech transcription)
- HTML
- Text-based formats (CSV, JSON, XML)
- ZIP files (iterates over contents)
- Youtube URLs
- EPubs
- ... and more!
## Principais Funcionalidades
- Conversão automática de tabelas e dados matriciais para Markdown
- Suporte para integração de dados externos e automação de atualização
- Geração de relatórios acadêmicos com layout padronizado
- Ferramentas CLI para uso em pipelines diversos
- Modularização via pacotes para diferentes necessidades
- Documentação completa e exemplos de uso

## Why Markdown?

Expand Down Expand Up @@ -71,76 +62,16 @@ To install MarkItDown, use pip: `pip install 'markitdown[all]'`. Alternatively,
```bash
git clone [email protected]:microsoft/markitdown.git
cd markitdown
pip install -e 'packages/markitdown[all]'
# Dependendo do stack, configure seu ambiente:
# Exemplo:
# poetry install
# ou pip install -r requirements.txt
```

## Usage

### Command-Line

```bash
markitdown path-to-file.pdf > document.md
```

Or use `-o` to specify the output file:

```bash
markitdown path-to-file.pdf -o document.md
```

You can also pipe content:

```bash
cat path-to-file.pdf | markitdown
```

### Optional Dependencies
MarkItDown has optional dependencies for activating various file formats. Earlier in this document, we installed all optional dependencies with the `[all]` option. However, you can also install them individually for more control. For example:

```bash
pip install 'markitdown[pdf, docx, pptx]'
```

will install only the dependencies for PDF, DOCX, and PPTX files.

At the moment, the following optional dependencies are available:

* `[all]` Installs all optional dependencies
* `[pptx]` Installs dependencies for PowerPoint files
* `[docx]` Installs dependencies for Word files
* `[xlsx]` Installs dependencies for Excel files
* `[xls]` Installs dependencies for older Excel files
* `[pdf]` Installs dependencies for PDF files
* `[outlook]` Installs dependencies for Outlook messages
* `[az-doc-intel]` Installs dependencies for Azure Document Intelligence
* `[audio-transcription]` Installs dependencies for audio transcription of wav and mp3 files
* `[youtube-transcription]` Installs dependencies for fetching YouTube video transcription

### Plugins

MarkItDown also supports 3rd-party plugins. Plugins are disabled by default. To list installed plugins:

```bash
markitdown --list-plugins
```

To enable plugins use:

```bash
markitdown --use-plugins path-to-file.pdf
```

To find available plugins, search GitHub for the hashtag `#markitdown-plugin`. To develop a plugin, see `packages/markitdown-sample-plugin`.

### Azure Document Intelligence

To use Microsoft Document Intelligence for conversion:

```bash
markitdown path-to-file.pdf -o document.md -d -e "<document_intelligence_endpoint>"
```

More information about how to set up an Azure Document Intelligence Resource can be found [here](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/how-to-guides/create-document-intelligence-resource?view=doc-intel-4.0.0)
## Exemplos de Uso
```shell
# Conversão de CSV para Markdown
python -m markitdown csv2md dados.csv > tabela.md

### Python API

Expand Down Expand Up @@ -182,6 +113,7 @@ print(result.text_content)
docker build -t markitdown:latest .
docker run --rm -i markitdown:latest < ~/your-file.pdf > output.md
```
Consulte a documentação integradada nos fontes e arquivos de exemplos no diretório `/packages`.

## Contributing

Expand Down Expand Up @@ -239,10 +171,10 @@ You can help by looking at issues or helping review PRs. Any issue or PR is welc

You can also contribute by creating and sharing 3rd party plugins. See `packages/markitdown-sample-plugin` for more details.

## Trademarks
## Comunidade e Suporte
- Relate problemas via [Issues](https://github.com/roberto-fgv/markitdown/issues)
- Dúvidas gerais no [SUPPORT.md](SUPPORT.md)
- Diretrizes de conduta em [CODE_OF_CONDUCT.md](CODE_OF_CONDUCT.md)

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
trademarks or logos is subject to and must follow
[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).
Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
Any use of third-party trademarks or logos are subject to those third-party's policies.
## Licença
Este projeto está licenciado sob os termos do arquivo [LICENSE](LICENSE).
11 changes: 11 additions & 0 deletions badges.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Status do Build
![Build Status](https://github.com/roberto-fgv/markitdown/actions/workflows/ci.yml/badge.svg)

# Cobertura de Código
![Coverage](https://img.shields.io/badge/coverage-unknown-lightgrey.svg)

# Licença
![License](https://img.shields.io/github/license/roberto-fgv/markitdown.svg)

---