Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
89307f0
feat: Implement SharePoint data extraction sources and configuration …
Oct 23, 2025
d9fe95c
Fix: Remove file size limit from SharePoint file configuration and ex…
Oct 23, 2025
dc6e8f5
Refactor: Replace loguru logger with dlt.common logger in SharePoint …
Oct 23, 2025
ad83f57
Refactor: Remove unused methods and imports from SharepointClient class
Oct 23, 2025
c94cb2f
fix: Change log level from success to info for SharePoint connection …
Oct 23, 2025
0724984
refactor: Improve code readability by formatting and organizing funct…
Oct 23, 2025
34776d1
fix: Remove unused attributes from SharepointListConfig and Sharepoin…
Oct 23, 2025
933becd
feat: Add SharePoint source with list and file extraction capabilities
Dec 8, 2025
0a76475
refactor: Update resource extraction in SharepointListSource and Shar…
Dec 8, 2025
8155602
Update helpers
Jan 7, 2026
ffb300a
refactor: Remove unused imports and file type from SharePoint configu…
Jan 7, 2026
08f98dd
refactor: Clean up code formatting and improve readability in SharePo…
Jan 7, 2026
f20b50b
refactor: Enhance type hints for SharePoint configuration and helper …
Jan 7, 2026
a57c0f5
refactor: Add 'Any' type hint to improve type flexibility in SharePoi…
Jan 7, 2026
7c7e514
refactor: Improve type hints and return types in SharePoint source an…
Jan 7, 2026
66df65a
refactor: Simplify list filtering logic in SharepointClient to use al…
Jan 7, 2026
2169274
refactor: Remove unused 'Union' type hint from SharePoint source files
Jan 7, 2026
3c69c68
test: Add assertions for SAS and SPSS file type functions in Sharepoi…
Jan 7, 2026
60dd0ee
fix: Add chunksize parameter to pandas_kwargs for SharePoint files co…
Jan 7, 2026
36a2d32
docs: Update README.md to clarify folder path validation and authenti…
Jan 7, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions pyproject.toml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The branch needs to be rebased on master, formatted with make format, and checked with the linter using make lint. I know I mentioned this earlier, but it’s still needed, so I wanted to remind again:

  • One file is not formatted
  • The linter is picking up an error that has already been resolved in master

Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,10 @@ scrapy = [
"scrapy>=2.11.0,<3",
"twisted==22.10.0",
]
sharepoint = [
"msal>=1.20.0",
"pandas>=2.0.0",
]

[tool.uv]
default-groups = [
Expand All @@ -113,6 +117,7 @@ default-groups = [
"airtable",
"filesystem",
"scrapy",
"sharepoint",
]

# [tool.uv.sources]
Expand Down
263 changes: 263 additions & 0 deletions sources/sharepoint/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,263 @@
# SharePoint Source

This source allows you to extract data from SharePoint lists and files using the Microsoft Graph API.

## Features

- Extract data from SharePoint lists
- Download and process files from SharePoint document libraries
- Support for multiple file formats (CSV, Excel, JSON, Parquet, SAS, SPSS)
- Incremental loading support for files based on modification time
- Flexible file filtering with regex patterns

## Prerequisites

Before using this source, you need:

1. **Azure AD Application Registration** with the following:
- Client ID
- Tenant ID
- Client Secret
- Microsoft Graph API permissions:
- `Sites.Read.All` or `Sites.ReadWrite.All`
- `Files.Read.All` (for file operations)

2. **SharePoint Site ID**: The unique identifier for your SharePoint site

## Configuration

### Credentials

Configure your credentials in `secrets.toml`:

```toml
[sources.sharepoint]
client_id = "your-client-id"
tenant_id = "your-tenant-id"
site_id = "your-site-id"
client_secret = "your-client-secret"
sub_site_id = "" # Optional: for sub-sites
```

### SharePoint List Configuration

```python
from sharepoint.sharepoint_files_config import SharepointListConfig

list_config = SharepointListConfig(
table_name="my_list_data",
list_title="My SharePoint List",
select="Title,Description,Status", # Optional: specific fields
is_incremental=False # Incremental not yet implemented
)
```

### SharePoint Files Configuration

```python
from sharepoint.sharepoint_files_config import SharepointFilesConfig, FileType

files_config = SharepointFilesConfig(
file_type=FileType.CSV,
folder_path="Documents/Reports",
table_name="reports_data",
file_name_startswith="report_",
pattern=r".*\.csv$", # Optional: regex pattern for filtering
pandas_kwargs={"sep": ","}, # Optional: pandas read options
is_file_incremental=True # Enable incremental loading
)
```

## Usage Examples

### Example 1: Load SharePoint List Data

```python
import dlt
from sharepoint import sharepoint_list, SharepointCredentials
from sharepoint.sharepoint_files_config import SharepointListConfig

# Configure credentials
credentials = SharepointCredentials()

# Configure list extraction
list_config = SharepointListConfig(
table_name="tasks",
list_title="Project Tasks"
)

# Create and run pipeline
pipeline = dlt.pipeline(
pipeline_name="sharepoint_list",
destination="duckdb",
dataset_name="sharepoint_data"
)

load_info = pipeline.run(
sharepoint_list(
sharepoint_list_config=list_config,
credentials=credentials
)
)
print(load_info)
```

### Example 2: Load Files from SharePoint

```python
import dlt
from sharepoint import sharepoint_files, SharepointCredentials
from sharepoint.sharepoint_files_config import SharepointFilesConfig, FileType

# Configure credentials
credentials = SharepointCredentials()

# Configure file extraction
files_config = SharepointFilesConfig(
file_type=FileType.CSV,
folder_path="Shared Documents/Reports",
table_name="monthly_reports",
file_name_startswith="report_",
pattern=r"202[4-5].*\.csv$",
is_file_incremental=True,
pandas_kwargs={
"sep": ",",
"encoding": "utf-8",
"chunksize": 1000, # Process in chunks of 1000 rows
}
)

# Create and run pipeline
pipeline = dlt.pipeline(
pipeline_name="sharepoint_files",
destination="duckdb",
dataset_name="sharepoint_data"
)

load_info = pipeline.run(
sharepoint_files(
sharepoint_files_config=files_config,
credentials=credentials
)
)
print(load_info)
```

### Example 3: Process Excel Files with Chunking

```python
files_config = SharepointFilesConfig(
file_type=FileType.EXCEL,
folder_path="Reports/Annual",
table_name="large_report",
file_name_startswith="annual_",
pandas_kwargs={
"sheet_name": "Data",
}
)
```

## Supported File Types

The source supports the following file types via pandas:

- `FileType.CSV` - CSV files
- `FileType.EXCEL` - Excel files (.xlsx, .xls)
- `FileType.JSON` - JSON files
- `FileType.PARQUET` - Parquet files
- `FileType.SAS` - SAS files
- `FileType.SPSS` - SPSS files

## Incremental Loading

### File Incremental Loading

When `is_file_incremental=True`, the source tracks the `lastModifiedDateTime` of files and only processes files that have been modified since the last run.

```python
files_config = SharepointFilesConfig(
file_type=FileType.CSV,
folder_path="Documents",
table_name="data",
file_name_startswith="data_",
is_file_incremental=True # Only process new/modified files
)
```

### List Incremental Loading

Incremental loading for SharePoint lists is not yet implemented.

## Advanced Configuration

### Folder Path Validation

Folder paths are automatically normalized:

- Leading/trailing slashes are removed
- Double slashes are not allowed
- Only alphanumeric characters, dashes, underscores, spaces, and dots are allowed

### Pattern Matching

The `pattern` parameter is automatically prefixed with `file_name_startswith`. For example:

```python
files_config = SharepointFilesConfig(
file_name_startswith="report_",
pattern=r"\d{8}\.csv$"
)
# Effective pattern: ^report_\d{8}\.csv$
```

### Pandas Kwargs

Any pandas read function parameters can be passed via `pandas_kwargs`:

```python
files_config = SharepointFilesConfig(
file_type=FileType.CSV,
folder_path="Documents",
table_name="data",
file_name_startswith="data",
pandas_kwargs={
"sep": ";",
"encoding": "latin1",
"decimal": ",",
"chunksize": 5000
}
)
```

## Troubleshooting

### Authentication Issues

If you encounter authentication errors:

1. Verify your Client ID, Tenant ID, and Client Secret are correct
2. Ensure your Azure AD app has the required permissions
3. Check that admin consent has been granted for the permissions

### File Not Found

If files are not being found:

1. Verify the folder path is correct (case-sensitive)
2. Check that the file name pattern matches your files
3. Ensure your app has access to the SharePoint site and folder

### Permission Errors

Ensure your Azure AD application has been granted:

- `Sites.Read.All` or `Sites.ReadWrite.All`
- `Files.Read.All`

And that admin consent has been provided for these permissions.

## Resources

- [Microsoft Graph API Documentation](https://learn.microsoft.com/en-us/graph/api/overview)
- [SharePoint REST API](https://learn.microsoft.com/en-us/sharepoint/dev/sp-add-ins/get-to-know-the-sharepoint-rest-service)
- [Azure AD App Registration](https://learn.microsoft.com/en-us/azure/active-directory/develop/quickstart-register-app)
Loading