Skip to content
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
b4016ef
updates to enable smarter data load
Chenglong-MS Sep 24, 2024
b663a50
Merge branch 'main' into dev
Chenglong-MS Sep 24, 2024
24ba6bd
experimental data cleaning on load function
Chenglong-MS Oct 3, 2024
ff6b585
wip
Chenglong-MS Oct 4, 2024
ea0c562
some fixes
Chenglong-MS Oct 7, 2024
5084e3a
supporting image uploads as inputs
Chenglong-MS Oct 8, 2024
d42a91e
preparing for pip release
Chenglong-MS Oct 9, 2024
2b74c7c
pip install from tar
danmarshall Oct 9, 2024
89af5c7
auto-launch
danmarshall Oct 9, 2024
3ab42e3
remove "f5"
danmarshall Oct 9, 2024
38c1a31
static image for codespace
danmarshall Oct 9, 2024
7b19a06
some clean up
Chenglong-MS Oct 9, 2024
4d9e7e3
update readme
Chenglong-MS Oct 9, 2024
5bb9c9e
cleaning up text
Chenglong-MS Oct 9, 2024
bdfda8a
merge diff
Chenglong-MS Oct 10, 2024
7d142f6
Fix code scanning alert no. 3: DOM text reinterpreted as HTML
Chenglong-MS Oct 10, 2024
5c0e286
Fix code scanning alert no. 6: DOM text reinterpreted as HTML
Chenglong-MS Oct 10, 2024
1ad4bcd
update to readme
Chenglong-MS Oct 10, 2024
a1dcfca
README change
Chenglong-MS Oct 10, 2024
d4d8d2f
add workflow
Chenglong-MS Oct 10, 2024
9e89425
update build
Chenglong-MS Oct 10, 2024
bf83d56
update build script
Chenglong-MS Oct 10, 2024
f85df75
update build script
Chenglong-MS Oct 10, 2024
0e052d8
update build script
Chenglong-MS Oct 10, 2024
c2fd657
fix typo in workflow
Chenglong-MS Oct 10, 2024
f45b5dd
try new install order
Chenglong-MS Oct 10, 2024
f8ce5a0
check
Chenglong-MS Oct 10, 2024
f9bc4a4
try include package information
Chenglong-MS Oct 10, 2024
52329d4
fix
Chenglong-MS Oct 10, 2024
cd55ce0
test again..
Chenglong-MS Oct 10, 2024
1c25fd9
try luck
Chenglong-MS Oct 10, 2024
b16a78a
try luck
Chenglong-MS Oct 10, 2024
363adcb
try luck
Chenglong-MS Oct 10, 2024
0a0d6b8
update pyproject
Chenglong-MS Oct 10, 2024
c729658
update manifest
Chenglong-MS Oct 10, 2024
d75c8dd
update build flow
Chenglong-MS Oct 10, 2024
7b27661
update upload script
Chenglong-MS Oct 11, 2024
fbb965a
try fix build
Chenglong-MS Oct 11, 2024
7fff58b
update publish scripts
Chenglong-MS Oct 11, 2024
2330fa5
prep
Chenglong-MS Oct 11, 2024
e623830
update readme
Chenglong-MS Oct 11, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .devcontainer/devcontainer.json
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
// "forwardPorts": [],

// Use 'postCreateCommand' to run commands after the container is created.
"postCreateCommand": "python3 -m venv /workspaces/data-formulator/venv && . /workspaces/data-formulator/venv/bin/activate && pip install -r /workspaces/data-formulator/requirements.txt --verbose && yarn install && yarn build"
"postCreateCommand": "python3 -m venv /workspaces/data-formulator/venv && . /workspaces/data-formulator/venv/bin/activate && pip install https://github.com/user-attachments/files/17319752/data_formulator-0.1.0.tar.gz --verbose && data_formulator"

// Configure tool-specific properties.
// "customizations": {},
Expand Down
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@


*openai-keys.env
**/*.ipynb_checkpoints/

.DS_Store
Expand Down
3 changes: 1 addition & 2 deletions CODESPACES.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,11 @@ You will need a GitHub account and to be logged in to use Codespaces.
### Step 2: Run the app
The codespace is a VSCode development environment in the cloud. Once the Codespace is created, start Data Formuator with the following steps:

* Press **F5** to run. Or if you prefer, click the **Run and Debug** tab on the left, and the **Start Debugging** button.
* A toast about port forwarding will appear, click the **Open in Browser** button.
* You will see the Data Formulator app!

<kbd>
<img width="528" alt="image" src="https://github.com/user-attachments/assets/e62bebda-8daf-4587-94d4-fede48de382b">
<img width="528" alt="image" src="https://github.com/user-attachments/assets/cb9e2123-4a42-4926-8b59-5bafb9be25fa">
</kbd>


Expand Down
2 changes: 2 additions & 0 deletions MANIFEST.IN
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
include py-src/data_formulator/dist/*
include py-src/data_formulator/dist/assets/*
39 changes: 28 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,25 @@

[![arxiv](https://img.shields.io/badge/Paper-arXiv:2408.16119-b31b1b.svg)](https://arxiv.org/abs/2408.16119)&ensp;
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)&ensp;
[![YouTube](https://img.shields.io/badge/YouTube-white?logo=youtube&logoColor=%23FF0000)](https://youtu.be/3ndlwt0Wi3c)&ensp;

</div>

Transform data and create rich visualizations iteratively with AI 🪄. Try Data Formulator now in GitHub Codespaces!

[![Open in GitHub Codespaces](https://github.com/codespaces/badge.svg)](https://codespaces.new/microsoft/data-formulator?quickstart=1)

## News 🔥

- [10-09-2024] Data Formulator python package released!
- You can now install Data Formulator using Python and run it locally, easily. [[check it out]](#get-started).
- Our Codespace configuration is also updated for fast start up ⚡️. [[try it now!]](https://codespaces.new/microsoft/data-formulator?quickstart=1)

- [10-09-2024] New experimental feature release:
- Loading an image or a messy data snippet into Data Formulator, with AI parsing and cleaning it for you(!).

- [10-01-2024] Initial release of Data Formulator, check out our [blog](https://www.microsoft.com/en-us/research/blog/data-formulator-exploring-how-ai-can-help-analysts-create-rich-data-visualizations/) and [video](https://youtu.be/3ndlwt0Wi3c)!


<kbd>
<a target="_blank" rel="noopener noreferrer" href="https://codespaces.new/microsoft/data-formulator?quickstart=1" title="open Data Formulator in GitHub Codespaces"><img src="public/data-formulator-screenshot.png"></a>
Expand All @@ -22,27 +34,32 @@ Transform data and create rich visualizations iteratively with AI 🪄. Try Data

**Data Formulator** is an application from Microsoft Research that uses large language models to transform data, expediting the practice of data visualization.

To create rich visualizations, data analysts often need to iterate back and forth among data processing and chart specification to achieve their goals. To achieve this, analysts need proficiency in data transformation and visualization tools, and they also spend effort managing the iteration history. This can be challenging!
Data Formulator is an AI-powered tool for analysts to iteratively create rich visualizations. Unlike most chat-based AI tools where users need to describe everything in natural language, Data Formulator combines *user interface interactions (UI)* and *natural language (NL) inputs* for easier interaction. This blended approach makes it easier for users to describe their chart designs while delegating data transformation to AI.

Data Formulator is an AI-powered tool for analysts to iteratively create rich visualizations. Unlike most chat-based AI tools where users need to describe everything in natural language, Data Formulator combines user interface interactions (UI) with natural language (NL) inputs. This blended approach makes it easier for users to describe their chart designs while delegating data transformation to AI.
## Get Started

Check out these cool Data Formulator features that can help you create impressive visualizations!
* Using the **blended UI and NL inputs** to describe the chart.
* Utilizing **data threads** to navigate the history and reuse previous results to create new ones instead of starting from scratch every time.
Play with Data Formulator with one of the following options:

## Get Started
- **Option 1: Install via Python PIP**

Use Python PIP for an easy setup experience, running locally.

```
>> pip install data_formulator
>> data_formulator
```

Choose one of the following options to set up Data Formulator:
Data Formulator will be automatically opened in the browser at [http://localhost:5000](http://localhost:5000).

- **Option 1: Codespaces**
- **Option 2: Codespaces (5 minute)**

Use Codespaces for an easy setup experience, as everything is preconfigured to get you up and running quickly. For more details, see [CODESPACES.md](CODESPACES.md).
You can also run Data Formualtor in codespace, we have everything pre-configured. For more details, see [CODESPACES.md](CODESPACES.md).

[![Open in GitHub Codespaces](https://github.com/codespaces/badge.svg)](https://codespaces.new/microsoft/data-formulator?quickstart=1)

- **Option 2: Local Installation**
- **Option 3: Working in the developer mode**

Opt for a local installation if you prefer full control over your development environment and the ability to customize the setup to your specific needs. For detailed instructions, refer to [DEVELOPMENT.md](DEVELOPMENT.md).
You can build Data Formulator locally if you prefer full control over your development environment and the ability to customize the setup to your specific needs. For detailed instructions, refer to [DEVELOPMENT.md](DEVELOPMENT.md).


## Using Data Formulator
Expand Down
2 changes: 1 addition & 1 deletion local_server.bat
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
:: Licensed under the MIT License.

@echo off
set FLASK_APP=app.py
set FLASK_APP=py-src/data_formulator/app.py
set FLASK_RUN_PORT=5000
set FLASK_RUN_HOST=0.0.0.0
flask run
2 changes: 1 addition & 1 deletion local_server.sh
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.

env FLASK_APP=app.py FLASK_RUN_PORT=5000 FLASK_RUN_HOST=0.0.0.0 flask run
env FLASK_APP=py-src/data_formulator/app.py FLASK_RUN_PORT=5000 FLASK_RUN_HOST=0.0.0.0 flask run
4 changes: 3 additions & 1 deletion package.json
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
"react": "^18.2.0",
"react-animate-height": "^3.0.4",
"react-animate-on-change": "^2.2.0",
"react-diff-viewer": "^3.1.1",
"react-dnd": "^16.0.1",
"react-dnd-html5-backend": "^16.0.1",
"react-dom": "^18.2.0",
Expand All @@ -40,7 +41,8 @@
"vega": "^5.23.0",
"vega-embed": "^6.21.0",
"vega-lite": "^5.5.0",
"vm-browserify": "^1.1.2"
"vm-browserify": "^1.1.2",
"validator": "^13.12.0"
},
"scripts": {
"start": "vite",
Expand Down
5 changes: 5 additions & 0 deletions py-src/data_formulator/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
from .app import run_app

__all__ = [
"run_app",
]
4 changes: 4 additions & 0 deletions py-src/data_formulator/__main__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
from .app import run_app

if __name__ == "__main__":
run_app()
22 changes: 22 additions & 0 deletions py-src/data_formulator/agents/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.

from data_formulator.agents.agent_concept_derive import ConceptDeriveAgent
from data_formulator.agents.agent_py_concept_derive import PyConceptDeriveAgent
from data_formulator.agents.agent_data_transformation import DataTransformationAgent
from data_formulator.agents.agent_data_transform_v2 import DataTransformationAgentV2
from data_formulator.agents.agent_data_load import DataLoadAgent
from data_formulator.agents.agent_sort_data import SortDataAgent
from data_formulator.agents.agent_data_clean import DataCleanAgent
from data_formulator.agents.agent_data_rec import DataRecAgent

__all__ = [
"ConceptDeriveAgent",
"PyConceptDeriveAgent",
"DataTransformationAgent",
"DataTransformationAgentV2",
"DataRecAgent",
"DataLoadAgent",
"SortDataAgent",
"DataCleanAgent"
]
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# Licensed under the MIT License.

import pandas as pd
from agents.agent_utils import generate_data_summary, extract_code_from_gpt_response
from data_formulator.agents.agent_utils import generate_data_summary, extract_code_from_gpt_response

import logging

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
APP_ROOT = os.path.abspath('..')
sys.path.append(os.path.abspath(APP_ROOT))

from agents.agent_utils import generate_data_summary, field_name_to_ts_variable_name, extract_code_from_gpt_response, infer_ts_datatype
from data_formulator.agents.agent_utils import generate_data_summary, field_name_to_ts_variable_name, extract_code_from_gpt_response, infer_ts_datatype

import logging

Expand Down
150 changes: 150 additions & 0 deletions py-src/data_formulator/agents/agent_data_clean.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.

import json
import pandas as pd

from data_formulator.agents.agent_utils import extract_json_objects, generate_data_summary, extract_code_from_gpt_response, field_name_to_ts_variable_name, infer_ts_datatype

import logging

logger = logging.getLogger(__name__)


SYSTEM_PROMPT = '''You are a data scientist to help user to generate or clean the raw input into a *csv block* (or tsv if that's the original format).
The output csv format should be readable into a python pandas dataframe directly.

Create [OUTPUT] based on [RAW DATA] provided. The output should have two components:

1. a csv codeblock that represents the cleaned data, as follows:

```csv
.....
```

2. a json object that explains the mode and cleaning rationale (wrap in a json block):

```json
{
"mode": ..., // one of "data generation" or "data cleaning" based on the provided task
"reason": ... // explain the cleaning reason here
}
```

**Important:**
- NEVER make assumptions or judgments about a person's gender, biological sex, sexuality, religion, race, nationality, ethnicity, political stance, socioeconomic status, mental health, invisible disabilities, medical conditions, personality type, social impressions, emotional state, and cognitive state.
- NEVER create formulas that could be used to discriminate based on age. Ageism of any form (explicit and implicit) is strictly prohibited.
- If above issue occurs, just copy the original data and return in the block

The cleaning process must follow instructions below:
* the output should be a structured csv table:
- if the raw data is unstructured, structure it into a csv table. If the table is in other formats, transform it into a csv table.
- if the raw data contain other informations other than the table, remove surrounding texts that does not belong to the table.
- if the raw data contains multiple levels of header, make it a flat table. It's ok to combine multiple levels of headers to form the new header to not lose information.
- if the table has footer or summary row, remove them, since they would not be compatible with the csv table format.
- the csv table should have the same number of cells for each line, according to the title. If there are some rows with missing values, patch them with empty cells.
- if the raw data has some rows that do not belong to the table, also remove them (e.g., subtitles in between rows)
- if the header row misses some columns, add their corresponding column names. E.g., when the header doesn't have an index column, but every row has an index value, add the missing column header.
* clean up columns with messy information
- if a column is number but some cells has annotations like "*" "?" or brackets, clean them up.
- if a column is number but has units like ($, %, s), convert them to number (make sure unit conversion is correct when multiple units exist like minute and second) and include unit in the header.
- you don't need to convert format of the cell.
* if the user asks about generating synthetic data:
- NEVER generate data that has implicit bias as noted above, if that happens, return a dummy data consisting of dummy columns with 'a, b, c' and numbers.
- NEVER generate data contain people's names, use "A" , "B", "C"... instead.
- If the user doesn't indicate how many rows to be generated, plan in generating a dataset with 10-20 rows depending on the content.
'''



EXAMPLE = '''
[RAW DATA]

Rank NOC Gold Silver Bronze Total
1 South Korea 5 1 1 7
2 France* 0 1 1 2
United States 0 1 1 2
4 China 0 1 0 1
Germany 0 1 0 1
6 Mexico 0 0 1 1
Turkey 0 0 1 1
Totals (7 entries) 5 5 5 15

[OUTPUT]

'''

class DataCleanAgent(object):

def __init__(self, client, model):
self.model = model
self.client = client

def run(self, content_type, raw_data):
"""derive a new concept based on the raw input data
"""

if content_type == "text":
user_prompt = {
"role": "user",
"content": [{
'type': 'text',
'text': f"[DATA]\n\n{raw_data}\n\n[OUTPUT]\n"
}]
}
elif content_type == "image":
user_prompt = {
'role': 'user',
'content': [ {
'type': 'text',
'text': '''[RAW_DATA]\n\n'''},
{
'type': 'image_url',
'image_url': {
"url": raw_data,
"detail": "high"
}
},
{
'type': 'text',
'text': '''[OUTPUT]\n\n'''
},
]
}

logger.info(user_prompt)

system_message = {
'role': 'system',
'content': [ {'type': 'text', 'text': SYSTEM_PROMPT}]}

messages = [system_message, user_prompt]

###### the part that calls open_ai
response = self.client.chat.completions.create(
model=self.model, messages = messages, temperature=0.7, max_tokens=1200,
top_p=0.95, n=1, frequency_penalty=0, presence_penalty=0, stop=None)

candidates = []
for choice in response.choices:

logger.info("\n=== Python Data Clean Agent ===>\n")
logger.info(choice.message.content + "\n")

code_blocks = extract_code_from_gpt_response(choice.message.content + "\n", "csv")
reason_blocks = extract_json_objects(choice.message.content + "\n")

if len(code_blocks) > 0:
result = {
'status': 'ok',
'content': code_blocks[-1],
'info': reason_blocks[-1] if len(reason_blocks) > 0 else {"reason": "no reason presented", "mode": "data cleaning"}
}
else:
result = {'status': 'other error', 'content': 'unable to extract code from response'}

result['dialog'] = [*messages, {"role": choice.message.role, "content": choice.message.content}]
result['agent'] = 'DataCleanAgent'
candidates.append(result)

return candidates
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@

import json

from agents.agent_utils import generate_data_summary, extract_code_from_gpt_response
import py_sandbox
from data_formulator.agents.agent_utils import generate_data_summary, extract_code_from_gpt_response
import data_formulator.py_sandbox as py_sandbox

import logging

Expand Down
Loading
Loading