microsoft · Chenglong-MS · Oct 11, 2024 · Sep 24, 2024 · Sep 24, 2024 · Oct 3, 2024
diff --git a/.devcontainer/devcontainer.json b/.devcontainer/devcontainer.json
@@ -16,7 +16,7 @@
 	// "forwardPorts": [],
 
 	// Use 'postCreateCommand' to run commands after the container is created.
-	"postCreateCommand": "python3 -m venv /workspaces/data-formulator/venv && . /workspaces/data-formulator/venv/bin/activate && pip install -r /workspaces/data-formulator/requirements.txt --verbose && yarn install && yarn build"
+	"postCreateCommand": "python3 -m venv /workspaces/data-formulator/venv && . /workspaces/data-formulator/venv/bin/activate && pip install https://github.com/user-attachments/files/17319752/data_formulator-0.1.0.tar.gz --verbose && data_formulator"
 
 	// Configure tool-specific properties.
 	// "customizations": {},

diff --git a/.gitignore b/.gitignore
@@ -1,3 +1,6 @@
+
+
+*openai-keys.env 
 **/*.ipynb_checkpoints/
 
 .DS_Store

diff --git a/CODESPACES.md b/CODESPACES.md
@@ -15,12 +15,11 @@ You will need a GitHub account and to be logged in to use Codespaces.
 ### Step 2: Run the app
 The codespace is a VSCode development environment in the cloud. Once the Codespace is created, start Data Formuator with the following steps:
 
-* Press **F5** to run. Or if you prefer, click the **Run and Debug** tab on the left, and the **Start Debugging** button.
 * A toast about port forwarding will appear, click the **Open in Browser** button.
 * You will see the Data Formulator app!
 
 <kbd>
-  <img width="528" alt="image" src="https://github.com/user-attachments/assets/e62bebda-8daf-4587-94d4-fede48de382b">
+  <img width="528" alt="image" src="https://github.com/user-attachments/assets/cb9e2123-4a42-4926-8b59-5bafb9be25fa">
 </kbd>
 
 

diff --git a/MANIFEST.IN b/MANIFEST.IN
@@ -0,0 +1,2 @@
+include py-src/data_formulator/dist/*
+include py-src/data_formulator/dist/assets/*
diff --git a/README.md b/README.md
@@ -6,13 +6,25 @@
 
 [![arxiv](https://img.shields.io/badge/Paper-arXiv:2408.16119-b31b1b.svg)](https://arxiv.org/abs/2408.16119)&ensp;
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)&ensp;
+[![YouTube](https://img.shields.io/badge/YouTube-white?logo=youtube&logoColor=%23FF0000)](https://youtu.be/3ndlwt0Wi3c)&ensp;
 
 </div>
 
 Transform data and create rich visualizations iteratively with AI 🪄. Try Data Formulator now in GitHub Codespaces!
 
 [![Open in GitHub Codespaces](https://github.com/codespaces/badge.svg)](https://codespaces.new/microsoft/data-formulator?quickstart=1)
 
+## News 🔥
+
+- [10-09-2024] Data Formulator python package released! 
+  - You can now install Data Formulator using Python and run it locally, easily. [[check it out]](#get-started).
+  - Our Codespace configuration is also updated for fast start up ⚡️. [[try it now!]](https://codespaces.new/microsoft/data-formulator?quickstart=1)
+
+- [10-09-2024] New experimental feature release: 
+  - Loading an image or a messy data snippet into Data Formulator, with AI parsing and cleaning it for you(!).
+
+- [10-01-2024] Initial release of Data Formulator, check out our [blog](https://www.microsoft.com/en-us/research/blog/data-formulator-exploring-how-ai-can-help-analysts-create-rich-data-visualizations/) and [video](https://youtu.be/3ndlwt0Wi3c)!
+
 
 <kbd>
   <a target="_blank" rel="noopener noreferrer" href="https://codespaces.new/microsoft/data-formulator?quickstart=1" title="open Data Formulator in GitHub Codespaces"><img src="public/data-formulator-screenshot.png"></a>
@@ -22,27 +34,32 @@ Transform data and create rich visualizations iteratively with AI 🪄. Try Data
 
 **Data Formulator** is an application from Microsoft Research that uses large language models to transform data, expediting the practice of data visualization.
 
-To create rich visualizations, data analysts often need to iterate back and forth among data processing and chart specification to achieve their goals. To achieve this, analysts need proficiency in data transformation and visualization tools, and they also spend effort managing the iteration history. This can be challenging!
+Data Formulator is an AI-powered tool for analysts to iteratively create rich visualizations. Unlike most chat-based AI tools where users need to describe everything in natural language, Data Formulator combines *user interface interactions (UI)* and *natural language (NL) inputs* for easier interaction. This blended approach makes it easier for users to describe their chart designs while delegating data transformation to AI. 
 
-Data Formulator is an AI-powered tool for analysts to iteratively create rich visualizations. Unlike most chat-based AI tools where users need to describe everything in natural language, Data Formulator combines user interface interactions (UI) with natural language (NL) inputs. This blended approach makes it easier for users to describe their chart designs while delegating data transformation to AI. 
+## Get Started
 
-Check out these cool Data Formulator features that can help you create impressive visualizations!
-* Using the **blended UI and NL inputs** to describe the chart. 
-* Utilizing **data threads** to navigate the history and reuse previous results to create new ones instead of starting from scratch every time.
+Play with Data Formulator with one of the following options:
 
-## Get Started
+- **Option 1: Install via Python PIP**
+
+  Use Python PIP for an easy setup experience, running locally.
+
+  ```
+  >> pip install data_formulator
+  >> data_formulator
+  ```
 
-Choose one of the following options to set up Data Formulator:
+  Data Formulator will be automatically opened in the browser at [http://localhost:5000](http://localhost:5000).
 
-- **Option 1: Codespaces**
+- **Option 2: Codespaces (5 minute)**
 
-  Use Codespaces for an easy setup experience, as everything is preconfigured to get you up and running quickly. For more details, see [CODESPACES.md](CODESPACES.md).
+  You can also run Data Formualtor in codespace, we have everything pre-configured. For more details, see [CODESPACES.md](CODESPACES.md).
 
   [![Open in GitHub Codespaces](https://github.com/codespaces/badge.svg)](https://codespaces.new/microsoft/data-formulator?quickstart=1)
 
-- **Option 2: Local Installation**
+- **Option 3: Working in the developer mode**
 
-  Opt for a local installation if you prefer full control over your development environment and the ability to customize the setup to your specific needs. For detailed instructions, refer to [DEVELOPMENT.md](DEVELOPMENT.md).
+  You can build Data Formulator locally if you prefer full control over your development environment and the ability to customize the setup to your specific needs. For detailed instructions, refer to [DEVELOPMENT.md](DEVELOPMENT.md).
 
 
 ## Using Data Formulator

diff --git a/local_server.bat b/local_server.bat
@@ -2,7 +2,7 @@
 :: Licensed under the MIT License.
 
 @echo off
-set FLASK_APP=app.py
+set FLASK_APP=py-src/data_formulator/app.py
 set FLASK_RUN_PORT=5000
 set FLASK_RUN_HOST=0.0.0.0
 flask run
diff --git a/local_server.sh b/local_server.sh
@@ -1,4 +1,4 @@
 # Copyright (c) Microsoft Corporation.
 # Licensed under the MIT License.
 
-env FLASK_APP=app.py FLASK_RUN_PORT=5000 FLASK_RUN_HOST=0.0.0.0 flask run
+env FLASK_APP=py-src/data_formulator/app.py FLASK_RUN_PORT=5000 FLASK_RUN_HOST=0.0.0.0 flask run
diff --git a/package.json b/package.json
@@ -24,6 +24,7 @@
         "react": "^18.2.0",
         "react-animate-height": "^3.0.4",
         "react-animate-on-change": "^2.2.0",
+        "react-diff-viewer": "^3.1.1",
         "react-dnd": "^16.0.1",
         "react-dnd-html5-backend": "^16.0.1",
         "react-dom": "^18.2.0",
@@ -40,7 +41,8 @@
         "vega": "^5.23.0",
         "vega-embed": "^6.21.0",
         "vega-lite": "^5.5.0",
-        "vm-browserify": "^1.1.2"
+        "vm-browserify": "^1.1.2",
+        "validator": "^13.12.0"
     },
     "scripts": {
         "start": "vite",

diff --git a/py-src/data_formulator/__init__.py b/py-src/data_formulator/__init__.py
@@ -0,0 +1,5 @@
+from .app import run_app
+
+__all__ = [
+    "run_app",
+]
diff --git a/py-src/data_formulator/__main__.py b/py-src/data_formulator/__main__.py
@@ -0,0 +1,4 @@
+from .app import run_app
+
+if __name__ == "__main__":
+    run_app()
diff --git a/py-src/data_formulator/agents/__init__.py b/py-src/data_formulator/agents/__init__.py
@@ -0,0 +1,22 @@
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT License.
+
+from data_formulator.agents.agent_concept_derive import ConceptDeriveAgent
+from data_formulator.agents.agent_py_concept_derive import PyConceptDeriveAgent
+from data_formulator.agents.agent_data_transformation import DataTransformationAgent
+from data_formulator.agents.agent_data_transform_v2 import DataTransformationAgentV2
+from data_formulator.agents.agent_data_load import DataLoadAgent
+from data_formulator.agents.agent_sort_data import SortDataAgent
+from data_formulator.agents.agent_data_clean import DataCleanAgent
+from data_formulator.agents.agent_data_rec import DataRecAgent
+
+__all__ = [
+    "ConceptDeriveAgent",
+    "PyConceptDeriveAgent",
+    "DataTransformationAgent",
+    "DataTransformationAgentV2",
+    "DataRecAgent",
+    "DataLoadAgent",
+    "SortDataAgent",
+    "DataCleanAgent"
+]
diff --git a/server/agents/agent_code_explanation.py → ...rmulator/agents/agent_code_explanation.py b/server/agents/agent_code_explanation.py → ...rmulator/agents/agent_code_explanation.py
@@ -2,7 +2,7 @@
 # Licensed under the MIT License.
 
 import pandas as pd
-from agents.agent_utils import generate_data_summary, extract_code_from_gpt_response
+from data_formulator.agents.agent_utils import generate_data_summary, extract_code_from_gpt_response
 
 import logging
 

diff --git a/server/agents/agent_concept_derive.py → ...formulator/agents/agent_concept_derive.py b/server/agents/agent_concept_derive.py → ...formulator/agents/agent_concept_derive.py
@@ -8,7 +8,7 @@
 APP_ROOT = os.path.abspath('..')
 sys.path.append(os.path.abspath(APP_ROOT))
 
-from agents.agent_utils import generate_data_summary, field_name_to_ts_variable_name, extract_code_from_gpt_response, infer_ts_datatype
+from data_formulator.agents.agent_utils import generate_data_summary, field_name_to_ts_variable_name, extract_code_from_gpt_response, infer_ts_datatype
 
 import logging
 

diff --git a/py-src/data_formulator/agents/agent_data_clean.py b/py-src/data_formulator/agents/agent_data_clean.py
@@ -0,0 +1,150 @@
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT License.
+
+import json
+import pandas as pd
+
+from data_formulator.agents.agent_utils import extract_json_objects, generate_data_summary, extract_code_from_gpt_response, field_name_to_ts_variable_name, infer_ts_datatype
+
+import logging
+
+logger = logging.getLogger(__name__)
+
+
+SYSTEM_PROMPT = '''You are a data scientist to help user to generate or clean the raw input into a *csv block* (or tsv if that's the original format). 
+The output csv format should be readable into a python pandas dataframe directly.
+
+Create [OUTPUT] based on [RAW DATA] provided. The output should have two components:
+
+1. a csv codeblock that represents the cleaned data, as follows:
+
+```csv
+.....
+```
+
+2. a json object that explains the mode and cleaning rationale (wrap in a json block):
+
+```json
+{
+    "mode": ..., // one of "data generation" or "data cleaning" based on the provided task
+    "reason": ... // explain the cleaning reason here
+}
+```
+
+**Important:**
+- NEVER make assumptions or judgments about a person's gender, biological sex, sexuality, religion, race, nationality, ethnicity, political stance, socioeconomic status, mental health, invisible disabilities, medical conditions, personality type, social impressions, emotional state, and cognitive state.
+- NEVER create formulas that could be used to discriminate based on age. Ageism of any form (explicit and implicit) is strictly prohibited.
+- If above issue occurs, just copy the original data and return in the block
+
+The cleaning process must follow instructions below:
+* the output should be a structured csv table: 
+    - if the raw data is unstructured, structure it into a csv table. If the table is in other formats, transform it into a csv table.
+    - if the raw data contain other informations other than the table, remove surrounding texts that does not belong to the table. 
+    - if the raw data contains multiple levels of header, make it a flat table. It's ok to combine multiple levels of headers to form the new header to not lose information.
+    - if the table has footer or summary row, remove them, since they would not be compatible with the csv table format.
+    - the csv table should have the same number of cells for each line, according to the title. If there are some rows with missing values, patch them with empty cells.
+    - if the raw data has some rows that do not belong to the table, also remove them (e.g., subtitles in between rows) 
+    - if the header row misses some columns, add their corresponding column names. E.g., when the header doesn't have an index column, but every row has an index value, add the missing column header.
+* clean up columns with messy information
+    - if a column is number but some cells has annotations like "*" "?" or brackets, clean them up.
+    - if a column is number but has units like ($, %, s), convert them to number (make sure unit conversion is correct when multiple units exist like minute and second) and include unit in the header.
+    - you don't need to convert format of the cell.
+* if the user asks about generating synthetic data:
+    - NEVER generate data that has implicit bias as noted above, if that happens, return a dummy data consisting of dummy columns with 'a, b, c' and numbers.
+    - NEVER generate data contain people's names, use "A" , "B", "C"... instead. 
+    - If the user doesn't indicate how many rows to be generated, plan in generating a dataset with 10-20 rows depending on the content.
+'''
+
+
+
+EXAMPLE = '''
+[RAW DATA]
+
+Rank	NOC	Gold	Silver	Bronze	Total
+1	 South Korea	5	1	1	7
+2	 France*	0	1	1	2
+ United States	0	1	1	2
+4	 China	0	1	0	1
+ Germany	0	1	0	1
+6	 Mexico	0	0	1	1
+ Turkey	0	0	1	1
+Totals (7 entries)	5	5	5	15
+
+[OUTPUT]
+
+'''
+
+class DataCleanAgent(object):
+
+    def __init__(self, client, model):
+        self.model = model
+        self.client = client
+
+    def run(self, content_type, raw_data):
+        """derive a new concept based on the raw input data
+        """
+
+        if content_type == "text":
+            user_prompt = {
+                "role": "user",
+                "content": [{
+                    'type': 'text',
+                    'text': f"[DATA]\n\n{raw_data}\n\n[OUTPUT]\n"
+                }]
+            }
+        elif content_type == "image":
+            user_prompt = {
+                'role': 'user',
+                'content': [ {
+                    'type': 'text',
+                    'text': '''[RAW_DATA]\n\n'''},
+                    {
+                        'type': 'image_url',
+                        'image_url': {
+                            "url": raw_data,
+                            "detail": "high"
+                        }
+                    },
+                    {
+                        'type': 'text',
+                        'text': '''[OUTPUT]\n\n'''
+                    }, 
+                ]
+            }
+
+        logger.info(user_prompt)
+
+        system_message = {
+            'role': 'system',
+            'content': [ {'type': 'text', 'text': SYSTEM_PROMPT}]}
+
+        messages = [system_message, user_prompt]
+
+        ###### the part that calls open_ai
+        response = self.client.chat.completions.create(
+            model=self.model, messages = messages, temperature=0.7, max_tokens=1200,
+            top_p=0.95, n=1, frequency_penalty=0, presence_penalty=0, stop=None)
+
+        candidates = []
+        for choice in response.choices:
+
+            logger.info("\n=== Python Data Clean Agent ===>\n")
+            logger.info(choice.message.content + "\n")
+
+            code_blocks = extract_code_from_gpt_response(choice.message.content + "\n", "csv")
+            reason_blocks = extract_json_objects(choice.message.content + "\n")
+
+            if len(code_blocks) > 0:
+                result = {
+                    'status': 'ok', 
+                    'content': code_blocks[-1], 
+                    'info': reason_blocks[-1] if len(reason_blocks) > 0 else {"reason": "no reason presented", "mode": "data cleaning"}
+                }
+            else:
+                result = {'status': 'other error', 'content': 'unable to extract code from response'}
+
+            result['dialog'] = [*messages, {"role": choice.message.role, "content": choice.message.content}]
+            result['agent'] = 'DataCleanAgent'
+            candidates.append(result)
+
+        return candidates
diff --git a/server/agents/agent_data_filter.py → ...ta_formulator/agents/agent_data_filter.py b/server/agents/agent_data_filter.py → ...ta_formulator/agents/agent_data_filter.py
@@ -3,8 +3,8 @@
 
 import json
 
-from agents.agent_utils import generate_data_summary, extract_code_from_gpt_response
-import py_sandbox
+from data_formulator.agents.agent_utils import generate_data_summary, extract_code_from_gpt_response
+import data_formulator.py_sandbox as py_sandbox
 
 import logging
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		include py-src/data_formulator/dist/*
		include py-src/data_formulator/dist/assets/*