-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Py packaging #31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Py packaging #31
Changes from 16 commits
Commits
Show all changes
41 commits
Select commit
Hold shift + click to select a range
b4016ef
updates to enable smarter data load
Chenglong-MS b663a50
Merge branch 'main' into dev
Chenglong-MS 24ba6bd
experimental data cleaning on load function
Chenglong-MS ff6b585
wip
Chenglong-MS ea0c562
some fixes
Chenglong-MS 5084e3a
supporting image uploads as inputs
Chenglong-MS d42a91e
preparing for pip release
Chenglong-MS 2b74c7c
pip install from tar
danmarshall 89af5c7
auto-launch
danmarshall 3ab42e3
remove "f5"
danmarshall 38c1a31
static image for codespace
danmarshall 7b19a06
some clean up
Chenglong-MS 4d9e7e3
update readme
Chenglong-MS 5bb9c9e
cleaning up text
Chenglong-MS bdfda8a
merge diff
Chenglong-MS 7d142f6
Fix code scanning alert no. 3: DOM text reinterpreted as HTML
Chenglong-MS 5c0e286
Fix code scanning alert no. 6: DOM text reinterpreted as HTML
Chenglong-MS 1ad4bcd
update to readme
Chenglong-MS a1dcfca
README change
Chenglong-MS d4d8d2f
add workflow
Chenglong-MS 9e89425
update build
Chenglong-MS bf83d56
update build script
Chenglong-MS f85df75
update build script
Chenglong-MS 0e052d8
update build script
Chenglong-MS c2fd657
fix typo in workflow
Chenglong-MS f45b5dd
try new install order
Chenglong-MS f8ce5a0
check
Chenglong-MS f9bc4a4
try include package information
Chenglong-MS 52329d4
fix
Chenglong-MS cd55ce0
test again..
Chenglong-MS 1c25fd9
try luck
Chenglong-MS b16a78a
try luck
Chenglong-MS 363adcb
try luck
Chenglong-MS 0a0d6b8
update pyproject
Chenglong-MS c729658
update manifest
Chenglong-MS d75c8dd
update build flow
Chenglong-MS 7b27661
update upload script
Chenglong-MS fbb965a
try fix build
Chenglong-MS 7fff58b
update publish scripts
Chenglong-MS 2330fa5
prep
Chenglong-MS e623830
update readme
Chenglong-MS File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,3 +1,6 @@ | ||
|
|
||
|
|
||
| *openai-keys.env | ||
| **/*.ipynb_checkpoints/ | ||
|
|
||
| .DS_Store | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,2 @@ | ||
| include py-src/data_formulator/dist/* | ||
| include py-src/data_formulator/dist/assets/* |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,4 +1,4 @@ | ||
| # Copyright (c) Microsoft Corporation. | ||
| # Licensed under the MIT License. | ||
|
|
||
| env FLASK_APP=app.py FLASK_RUN_PORT=5000 FLASK_RUN_HOST=0.0.0.0 flask run | ||
| env FLASK_APP=py-src/data_formulator/app.py FLASK_RUN_PORT=5000 FLASK_RUN_HOST=0.0.0.0 flask run |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| from .app import run_app | ||
|
|
||
| __all__ = [ | ||
| "run_app", | ||
| ] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,4 @@ | ||
| from .app import run_app | ||
|
|
||
| if __name__ == "__main__": | ||
| run_app() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,22 @@ | ||
| # Copyright (c) Microsoft Corporation. | ||
| # Licensed under the MIT License. | ||
|
|
||
| from data_formulator.agents.agent_concept_derive import ConceptDeriveAgent | ||
| from data_formulator.agents.agent_py_concept_derive import PyConceptDeriveAgent | ||
| from data_formulator.agents.agent_data_transformation import DataTransformationAgent | ||
| from data_formulator.agents.agent_data_transform_v2 import DataTransformationAgentV2 | ||
| from data_formulator.agents.agent_data_load import DataLoadAgent | ||
| from data_formulator.agents.agent_sort_data import SortDataAgent | ||
| from data_formulator.agents.agent_data_clean import DataCleanAgent | ||
| from data_formulator.agents.agent_data_rec import DataRecAgent | ||
|
|
||
| __all__ = [ | ||
| "ConceptDeriveAgent", | ||
| "PyConceptDeriveAgent", | ||
| "DataTransformationAgent", | ||
| "DataTransformationAgentV2", | ||
| "DataRecAgent", | ||
| "DataLoadAgent", | ||
| "SortDataAgent", | ||
| "DataCleanAgent" | ||
| ] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,150 @@ | ||
| # Copyright (c) Microsoft Corporation. | ||
| # Licensed under the MIT License. | ||
|
|
||
| import json | ||
| import pandas as pd | ||
|
|
||
| from data_formulator.agents.agent_utils import extract_json_objects, generate_data_summary, extract_code_from_gpt_response, field_name_to_ts_variable_name, infer_ts_datatype | ||
|
|
||
| import logging | ||
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| SYSTEM_PROMPT = '''You are a data scientist to help user to generate or clean the raw input into a *csv block* (or tsv if that's the original format). | ||
| The output csv format should be readable into a python pandas dataframe directly. | ||
|
|
||
| Create [OUTPUT] based on [RAW DATA] provided. The output should have two components: | ||
|
|
||
| 1. a csv codeblock that represents the cleaned data, as follows: | ||
|
|
||
| ```csv | ||
| ..... | ||
| ``` | ||
|
|
||
| 2. a json object that explains the mode and cleaning rationale (wrap in a json block): | ||
|
|
||
| ```json | ||
| { | ||
| "mode": ..., // one of "data generation" or "data cleaning" based on the provided task | ||
| "reason": ... // explain the cleaning reason here | ||
| } | ||
| ``` | ||
|
|
||
| **Important:** | ||
| - NEVER make assumptions or judgments about a person's gender, biological sex, sexuality, religion, race, nationality, ethnicity, political stance, socioeconomic status, mental health, invisible disabilities, medical conditions, personality type, social impressions, emotional state, and cognitive state. | ||
| - NEVER create formulas that could be used to discriminate based on age. Ageism of any form (explicit and implicit) is strictly prohibited. | ||
| - If above issue occurs, just copy the original data and return in the block | ||
|
|
||
| The cleaning process must follow instructions below: | ||
| * the output should be a structured csv table: | ||
| - if the raw data is unstructured, structure it into a csv table. If the table is in other formats, transform it into a csv table. | ||
| - if the raw data contain other informations other than the table, remove surrounding texts that does not belong to the table. | ||
| - if the raw data contains multiple levels of header, make it a flat table. It's ok to combine multiple levels of headers to form the new header to not lose information. | ||
| - if the table has footer or summary row, remove them, since they would not be compatible with the csv table format. | ||
| - the csv table should have the same number of cells for each line, according to the title. If there are some rows with missing values, patch them with empty cells. | ||
| - if the raw data has some rows that do not belong to the table, also remove them (e.g., subtitles in between rows) | ||
| - if the header row misses some columns, add their corresponding column names. E.g., when the header doesn't have an index column, but every row has an index value, add the missing column header. | ||
| * clean up columns with messy information | ||
| - if a column is number but some cells has annotations like "*" "?" or brackets, clean them up. | ||
| - if a column is number but has units like ($, %, s), convert them to number (make sure unit conversion is correct when multiple units exist like minute and second) and include unit in the header. | ||
| - you don't need to convert format of the cell. | ||
| * if the user asks about generating synthetic data: | ||
| - NEVER generate data that has implicit bias as noted above, if that happens, return a dummy data consisting of dummy columns with 'a, b, c' and numbers. | ||
| - NEVER generate data contain people's names, use "A" , "B", "C"... instead. | ||
| - If the user doesn't indicate how many rows to be generated, plan in generating a dataset with 10-20 rows depending on the content. | ||
| ''' | ||
|
|
||
|
|
||
|
|
||
| EXAMPLE = ''' | ||
| [RAW DATA] | ||
|
|
||
| Rank NOC Gold Silver Bronze Total | ||
| 1 South Korea 5 1 1 7 | ||
| 2 France* 0 1 1 2 | ||
| United States 0 1 1 2 | ||
| 4 China 0 1 0 1 | ||
| Germany 0 1 0 1 | ||
| 6 Mexico 0 0 1 1 | ||
| Turkey 0 0 1 1 | ||
| Totals (7 entries) 5 5 5 15 | ||
|
|
||
| [OUTPUT] | ||
|
|
||
| ''' | ||
|
|
||
| class DataCleanAgent(object): | ||
|
|
||
| def __init__(self, client, model): | ||
| self.model = model | ||
| self.client = client | ||
|
|
||
| def run(self, content_type, raw_data): | ||
| """derive a new concept based on the raw input data | ||
| """ | ||
|
|
||
| if content_type == "text": | ||
| user_prompt = { | ||
| "role": "user", | ||
| "content": [{ | ||
| 'type': 'text', | ||
| 'text': f"[DATA]\n\n{raw_data}\n\n[OUTPUT]\n" | ||
| }] | ||
| } | ||
| elif content_type == "image": | ||
| user_prompt = { | ||
| 'role': 'user', | ||
| 'content': [ { | ||
| 'type': 'text', | ||
| 'text': '''[RAW_DATA]\n\n'''}, | ||
| { | ||
| 'type': 'image_url', | ||
| 'image_url': { | ||
| "url": raw_data, | ||
| "detail": "high" | ||
| } | ||
| }, | ||
| { | ||
| 'type': 'text', | ||
| 'text': '''[OUTPUT]\n\n''' | ||
| }, | ||
| ] | ||
| } | ||
|
|
||
| logger.info(user_prompt) | ||
|
|
||
| system_message = { | ||
| 'role': 'system', | ||
| 'content': [ {'type': 'text', 'text': SYSTEM_PROMPT}]} | ||
|
|
||
| messages = [system_message, user_prompt] | ||
|
|
||
| ###### the part that calls open_ai | ||
| response = self.client.chat.completions.create( | ||
| model=self.model, messages = messages, temperature=0.7, max_tokens=1200, | ||
| top_p=0.95, n=1, frequency_penalty=0, presence_penalty=0, stop=None) | ||
|
|
||
| candidates = [] | ||
| for choice in response.choices: | ||
|
|
||
| logger.info("\n=== Python Data Clean Agent ===>\n") | ||
| logger.info(choice.message.content + "\n") | ||
|
|
||
| code_blocks = extract_code_from_gpt_response(choice.message.content + "\n", "csv") | ||
| reason_blocks = extract_json_objects(choice.message.content + "\n") | ||
|
|
||
| if len(code_blocks) > 0: | ||
| result = { | ||
| 'status': 'ok', | ||
| 'content': code_blocks[-1], | ||
| 'info': reason_blocks[-1] if len(reason_blocks) > 0 else {"reason": "no reason presented", "mode": "data cleaning"} | ||
| } | ||
| else: | ||
| result = {'status': 'other error', 'content': 'unable to extract code from response'} | ||
|
|
||
| result['dialog'] = [*messages, {"role": choice.message.role, "content": choice.message.content}] | ||
| result['agent'] = 'DataCleanAgent' | ||
| candidates.append(result) | ||
|
|
||
| return candidates |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.