Skip to content
Merged
1 change: 1 addition & 0 deletions doc/api/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -294,6 +294,7 @@ conversion of Python variables to GMT virtual files:
clib.Session.virtualfile_from_grid
clib.Session.virtualfile_in
clib.Session.virtualfile_out
clib.Session.return_table

Low level access (these are mostly used by the :mod:`pygmt.clib` package):

Expand Down
113 changes: 113 additions & 0 deletions pygmt/clib/session.py
Original file line number Diff line number Diff line change
Expand Up @@ -1738,6 +1738,119 @@ def read_virtualfile(
dtype = {"dataset": _GMT_DATASET, "grid": _GMT_GRID}[kind]
return ctp.cast(pointer, ctp.POINTER(dtype))

def return_table(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method name return_table was initially proposed in #1318 (comment), but is it a good name? At line 1845, we use kind="dataset", so maybe rename it to return_dataset or output_dataset? In the future, I think we will add more methods that return a grid/image/cpt/cube, so consistent method names are preferred.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think return_dataset is better than return_table. Renamed in ce029b2.

Copy link
Member

@weiji14 weiji14 Mar 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about virtualfile_to_dataset, or vfile_to_dataset (shorter)? A bit more explicit to say that a conversion is happening from a virtualfile to a tabular dataset.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

virtualfile_to_dataset/vfile_to_dataset sounds a good name. I like vfile_to_dataset, but do we want to also rename other functions for consistency?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean rename the virtualfile_from_* methods? Let's maybe not do that (lazy to deprecate more names). We can go with virtualfile_to_dataset for consistency.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to rename the virtualfile_in/virtualfile_out to vfile_in/vfile_out.

Copy link
Member

@weiji14 weiji14 Mar 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep with the long name (virtualfile_to_dataset). Renaming virtualfile_out -> vfile_out and virtualfile_in -> vfile_in will be more work and confusing (we'll need to manually update the changelog to track that virtualfile_from_data became virtualfile_in which became vfile_in).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 711142c.

self,
output_type: Literal["pandas", "numpy", "file"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we set a default output type here? It looks like we're using pandas as the default in #3092.

Suggested change
output_type: Literal["pandas", "numpy", "file"],
output_type: Literal["pandas", "numpy", "file"] = "pandas",

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes no differences because we always call the function with the output_type parameter, e.g.,:

        return lib.return_dataset(
            output_type=output_type,
            vfile=vouttbl,
        )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it doesn't make any difference in the PyGMT modules, but this is a good central location to document that output_type="pandas" is the default output (though in #1318, it seemed like most of us were in favour of output_type="input" or auto as the default).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

output_type="input" or auto may not make sense for PyGMT, especially in cases like:

  1. the input data is a file, then auto means outputting to a file by default, then outfile is required.
  2. the input data is vectors (e.g., x/y/z) and each vector can be a list/ndarray/pd.Series. Then what's the expected format if auto/input is used?

Copy link
Member

@weiji14 weiji14 Mar 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, not saying that output_type="auto" would be easy to implement 🙂 I think the default output_type="pandas" is fine for now since it is an in-memory format that can be converted to virtualfiles relatively easily. We can discuss more about what the ideal output type would be in #1318 (if there is still any debate that needs to be had).

vfile: str,
column_names: list[str] | None = None,
) -> pd.DataFrame | np.ndarray | None:
"""
Return an output table from a virtual file based on the output type.

Parameters
----------
output_type
Desired output type of the result data.

- ``"pandas"`` will return a :class:`pandas.DataFrame` object.
- ``"numpy"`` will return a :class:`numpy.ndarray` object.
- ``"file"`` means the result was saved to a file and will return ``None``.
vfile
The virtual file name that stores the result data. Required for ``"pandas"``
and ``"numpy"`` output type.
column_names
The column names for the :class:`pandas.DataFrame` output.

Returns
-------
table
The output table. If ``output_type="file"`` returns ``None``.

Examples
--------
>>> from pathlib import Path
>>> import numpy as np
>>> import pandas as pd
>>>
>>> from pygmt.helpers import GMTTempFile
>>> from pygmt.clib import Session
>>>
>>> with GMTTempFile(suffix=".txt") as tmpfile:
... # prepare the sample data file
... with open(tmpfile.name, mode="w") as fp:
... print(">", file=fp)
... print("1.0 2.0 3.0 TEXT1 TEXT23", file=fp)
... print("4.0 5.0 6.0 TEXT4 TEXT567", file=fp)
... print(">", file=fp)
... print("7.0 8.0 9.0 TEXT8 TEXT90", file=fp)
... print("10.0 11.0 12.0 TEXT123 TEXT456789", file=fp)
...
... # file output
... with Session() as lib:
... with GMTTempFile(suffix=".txt") as outtmp:
... with lib.virtualfile_out(
... kind="dataset", fname=outtmp.name
... ) as vouttbl:
... lib.call_module("read", f"{tmpfile.name} {vouttbl} -Td")
... result = lib.return_table(output_type="file", vfile=vouttbl)
... assert result is None
... assert Path(outtmp.name).stat().st_size > 0
...
... # numpy output
... with Session() as lib:
... with lib.virtualfile_out(kind="dataset") as vouttbl:
... lib.call_module("read", f"{tmpfile.name} {vouttbl} -Td")
... outnp = lib.return_table(output_type="numpy", vfile=vouttbl)
... assert isinstance(outnp, np.ndarray)
...
... # pandas output
... with Session() as lib:
... with lib.virtualfile_out(kind="dataset") as vouttbl:
... lib.call_module("read", f"{tmpfile.name} {vouttbl} -Td")
... outpd = lib.return_table(output_type="pandas", vfile=vouttbl)
... assert isinstance(outpd, pd.DataFrame)
...
... # pandas output with specified column names
... with Session() as lib:
... with lib.virtualfile_out(kind="dataset") as vouttbl:
... lib.call_module("read", f"{tmpfile.name} {vouttbl} -Td")
... outpd2 = lib.return_table(
... output_type="pandas",
... vfile=vouttbl,
... column_names=["col1", "col2", "col3", "coltext"],
... )
... assert isinstance(outpd2, pd.DataFrame)
>>> outnp
array([[1.0, 2.0, 3.0, 'TEXT1 TEXT23'],
[4.0, 5.0, 6.0, 'TEXT4 TEXT567'],
[7.0, 8.0, 9.0, 'TEXT8 TEXT90'],
[10.0, 11.0, 12.0, 'TEXT123 TEXT456789']], dtype=object)
>>> outpd
0 1 2 3
0 1.0 2.0 3.0 TEXT1 TEXT23
1 4.0 5.0 6.0 TEXT4 TEXT567
2 7.0 8.0 9.0 TEXT8 TEXT90
3 10.0 11.0 12.0 TEXT123 TEXT456789
>>> outpd2
col1 col2 col3 coltext
0 1.0 2.0 3.0 TEXT1 TEXT23
1 4.0 5.0 6.0 TEXT4 TEXT567
2 7.0 8.0 9.0 TEXT8 TEXT90
3 10.0 11.0 12.0 TEXT123 TEXT456789
"""
if output_type == "file": # Already written to file, so return None
return None

# Read the virtual file as a GMT dataset and convert to pandas.DataFrame
result = self.read_virtualfile(vfile, kind="dataset").contents.to_dataframe()
if output_type == "numpy": # numpy.ndarray output
return result.to_numpy()

# Assign column names
if column_names is not None:
result.columns = column_names
return result # pandas.DataFrame output

def extract_region(self):
"""
Extract the WESN bounding box of the currently active figure.
Expand Down