Skip to content

Conversation

@carlossanlop
Copy link
Contributor

@carlossanlop carlossanlop commented May 28, 2022

Fixes: #69544
Fixes: #69935

Here is a full description of my investigation on the behavior of extended attributes in the tar tool: #69933 (comment)

@ghost
Copy link

ghost commented May 28, 2022

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, to please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

@ghost
Copy link

ghost commented May 28, 2022

Tagging subscribers to this area: @dotnet/area-system-io
See info in area-owners.md if you want to be subscribed.

Issue Details

Not ready yet, do not review.

Author: carlossanlop
Assignees: carlossanlop
Labels:

area-System.IO

Milestone: -

@carlossanlop carlossanlop changed the title [Draft] Some tar tests [Draft] Allow multiple Global Extended Attributes entries in Tar archives with PAX format May 28, 2022
@carlossanlop carlossanlop added this to the 7.0.0 milestone May 28, 2022
@carlossanlop carlossanlop added the blocked Issue/PR is blocked on something - see comments label May 28, 2022
@carlossanlop carlossanlop changed the title [Draft] Allow multiple Global Extended Attributes entries in Tar archives with PAX format [Draft] Move Format property to TarEntry and allow multiple Global Extended Attributes entries Jun 1, 2022
@carlossanlop carlossanlop removed the blocked Issue/PR is blocked on something - see comments label Jun 1, 2022
@carlossanlop carlossanlop marked this pull request as ready for review June 1, 2022 05:05
@carlossanlop carlossanlop changed the title [Draft] Move Format property to TarEntry and allow multiple Global Extended Attributes entries Move Format property to TarEntry and allow multiple Global Extended Attributes entries Jun 1, 2022
@carlossanlop carlossanlop requested a review from bartonjs June 1, 2022 19:43
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only thing left to address. Depending on the name attribute in the global extended attributes, I need to apply it to the file, if applicable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my detailed answer here: #69933 (comment)

@carlossanlop carlossanlop marked this pull request as draft June 3, 2022 04:01
@carlossanlop
Copy link
Contributor Author

carlossanlop commented Jun 3, 2022

I read several sources that describe how the gnu ​tar tool and the pax tool​ handle extended attributes.

I was able to generate a GEA that contains global values for the header fields mode, ​gname​, ​uname​, ​gid, ​uid​, ​mtime​, ​atime​ and ​ctime​. These are fields that would make sense to override on all the subsequent entries.

*Note: Other header fields like name, linkname, checksum, size, typeflag do not make sense to override, since they need to be unique to each entry.

What I have not been able to reproduce is extracting an archive that contains a GEA entry overriding such values: I can extract the files, but the extracted files do not acquire the overriden metadata information.

Conclusion:

  • I will remove all logic that looks for reserved attribute names (mode, gname, uname, gid, uid, mtime, atime, ctime) in the extended attributes and then applies it to the header fields. I don't see it happening in the tool.
  • I will keep the logic that collects long fields and saves it in the extended attributes for all entries. This is always happening when creating archives, even if they don't get propagated to the extracted files. In fact, I have it documented this way:
The following entries are always found in the Extended Attributes dictionary of any PAX entry:

- Modification time, under the name mtime, as a double number.
- Access time, under the name atime, as a doublenumber.
- Change time, under the name ctime, as a double number.
- Path, under the name `path`, as a string.

The following entries are only found in the Extended Attributes dictionary of a PAX entry if certain conditions are met:

- Group name, under the name gname, as a string, if it is larger than 32 bytes.
- User name, under the name uname, as a string, if it is larger than 32 bytes.
- File length, under the name size, as an integer, if the string representation of the number is larger than 12 bytes.
  • The user will have full freedom to add/remove extended attributes and global extended attributes and do what they please with them, but only when manually iterating through the entries and inspecting the attributes dictionary. The TarFile static methods won't do anything about them.
  • If in the future we are notified that we indeed need to apply the metadata fields from extended attributes to the extracted files, we can do it, if we are provided with proof that it is also being done by the tar and the pax tools.

Details

So I think we should avoid overriding any header fields with values found in the extended attributes

Here is how I tested it in Ubuntu:

carlos@ubuntuwsl:~$ pwd
/home/carlos

carlos@ubuntuwsl:~$ mkdir testtar

carlos@ubuntuwsl:~$ cd testtar

carlos@ubuntuwsl:~/testtar$ mkdir sourcedir

carlos@ubuntuwsl:~/testtar$ echo "Hello world" > sourcedir/file.txt

# Need to see the uid of a user I created called 'dotnet'
carlos@ubuntuwsl:~/testtar$ cat /etc/passwd
...
root:x:0:0:root:/root:/bin/bash
carlos:x:1000:1000:,,,:/home/carlos:/bin/bash
dotnet:x:7913:3580::/home/dotnet:/bin/sh
...

# Need to see the gid of a group I created called 'devdiv'
carlos@ubuntuwsl:~/testtar$ cat /etc/group
...
root:x:0:
carlos:x:1000:
devdiv:x:3579:dotnet
...

Here's how a PAX archive with a GEA entry at the beginning:

carlos@ubuntuwsl:~/testtar$ tar cvf archive.tar sourcedir/ --format=posix --pax-option=uname=dotnet,gname=devdiv,uid=7913,gid=3579,mode=0000777,'atime={now}','mtime={now}','ctime={now}'
tar: Option --pax-option: Treating date 'now' as 2022-06-03 14:23:33.3516828
tar: Option --pax-option: Treating date 'now' as 2022-06-03 14:23:33.3516599
tar: Option --pax-option: Treating date 'now' as 2022-06-03 14:23:33.351568
sourcedir/
sourcedir/file.txt

This is how the archive looks in a hex editor:

carlos@ubuntuwsl:~/testtar$ tar xvf archive.tar --directory destinationdir/
sourcedir/
sourcedir/file.txt

# %x=atime (last access), %y=mtime (last modification), %z=ctime (last data change)
# Notice the only value that shows exactly like in the output above is mtime (2022-06-03 14:23:33.3516599) but that's because the same timestamp was saved in the header field value
carlos@ubuntuwsl:~/testtar$ stat -c %n,%x,%y,%z destinationdir/*
destinationdir/sourcedir,2022-06-03 14:24:39.449849400 -0700,2022-06-03 14:23:33.351659900 -0700,2022-06-03 14:24:21.169849400 -0700
 
carlos@ubuntuwsl:~/testtar$ stat -c %n,%x,%y,%z destinationdir/sourcedir/*
destinationdir/sourcedir/file.txt,2022-06-03 14:24:21.169849400 -0700,2022-06-03 14:23:33.351659900 -0700,2022-06-03 14:24:21.169849400 -0700

Extract using the pax tool, same results:

carlos@ubuntuwsl:~/testtar$ rm -r destinationdir/*

carlos@ubuntuwsl:~/testtar$ cd destinationdir/

carlos@ubuntuwsl:~/testtar/destinationdir$ pax -rf ../archive.tar 

carlos@ubuntuwsl:~/testtar$ stat -c %n,%x,%y,%z destinationdir/*
destinationdir/sourcedir,2022-06-03 14:24:39.449849400 -0700,2022-06-03 14:23:33.351659900 -0700,2022-06-03 14:24:21.169849400 -0700
 
carlos@ubuntuwsl:~/testtar$ stat -c %n,%x,%y,%z destinationdir/sourcedir/*
destinationdir/sourcedir/file.txt,2022-06-03 14:24:21.169849400 -0700,2022-06-03 14:23:33.351659900 -0700,2022-06-03 14:24:21.169849400 -0700

Now here is an alternative way to generate a PAX archive without a GEA entry at the beginning, but with attributes that apply to all entries: By using := to assign attributes, they get added to the extended attributes entry that precedes every normal entry, instead of adding a single GEA entry at the beginning.

carlos@ubuntuwsl:~/testtar$ tar cvf archive.tar sourcedir/ --format=posix --pax-option=uname:=dotnet,gname:=devdiv,uid:=7913,gid:=3579,mode:=0000777,'atime:={now}','mtime:={now}','ctime:={now}'
tar: Option --pax-option: Treating date 'now' as 2022-06-03 14:30:36.0733551
tar: Option --pax-option: Treating date 'now' as 2022-06-03 14:30:36.0733517
tar: Option --pax-option: Treating date 'now' as 2022-06-03 14:30:36.0732873
sourcedir/
sourcedir/file.txt

This is how the archive looks in a hex editor. Notice there is no GEA, but the two EA entries contain all the custom extended attributes:

Extract using the gnu tar tool:

carlos@ubuntuwsl:~/testtar$ rm archive.tar 

carlos@ubuntuwsl:~/testtar$ rm -r destinationdir/*

carlos@ubuntuwsl:~/testtar$ tar xvf archive.tar --directory destinationdir/
tar: Ignoring unknown extended header keyword 'mode'
sourcedir/
tar: Ignoring unknown extended header keyword 'mode'
sourcedir/file.txt

# Once again, notice the only value that shows exactly like in the output above is mtime (2022-06-03 14:30:36.0733517) but that's because the same timestamp was saved in the header field value
carlos@ubuntuwsl:~/testtar$ stat -c %n,%x,%y,%z destinationdir/*
destinationdir/sourcedir,2022-06-03 14:31:17.229849400 -0700,2022-06-03 14:30:36.073351700 -0700,2022-06-03 14:31:17.229849400 -0700

carlos@ubuntuwsl:~/testtar$ stat -c %n,%x,%y,%z destinationdir/sourcedir/*
destinationdir/sourcedir/file.txt,2022-06-03 14:31:17.229849400 -0700,2022-06-03 14:30:36.073351700 -0700,2022-06-03 14:31:17.229849400 -0700

Extract using the pax tool, same results:

carlos@ubuntuwsl:~/testtar$ rm -r destinationdir/*

carlos@ubuntuwsl:~/testtar$ cd destinationdir/

carlos@ubuntuwsl:~/testtar/destinationdir$ pax -rf ../archive.tar 

carlos@ubuntuwsl:~/testtar$ stat -c %n,%x,%y,%z destinationdir/*
destinationdir/sourcedir,2022-06-03 14:31:17.229849400 -0700,2022-06-03 14:30:36.073351700 -0700,2022-06-03 14:31:17.229849400 -0700

carlos@ubuntuwsl:~/testtar$ stat -c %n,%x,%y,%z destinationdir/sourcedir/*
destinationdir/sourcedir/file.txt,2022-06-03 14:31:17.229849400 -0700,2022-06-03 14:30:36.073351700 -0700,2022-06-03 14:31:17.229849400 -0700

Sources

@carlossanlop
Copy link
Contributor Author

Will submit this change divided in two PRs for simplicity.

@carlossanlop carlossanlop deleted the PaxGea branch June 21, 2022 02:21
@ghost ghost locked as resolved and limited conversation to collaborators Jul 21, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

1 participant