Importance of User Agent With Gallery-dl #4497

cheese529 · 2023-09-03T14:39:01Z

cheese529
Sep 3, 2023

So during the last few days I tried to archive as much of gfycat as I could. Ended up with about almost 200GB worth of content. Now unfortunately I did not realize this at the time but in my config the user agent I had was "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/115.0", meanwhile my current firefox user agent is Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/117.0 Meaning I was using a user agent which was about 2 months old.

How important is the user agent inside gallery-dls config and how much of a negative impact do you guys think this could have had on the scraping that I did? Is there a chance that I'd have missed some content due to using an older user agent ?

Answered by hrxn

Sep 3, 2023

Well, it depends. This gets checked (i.e., matching cookies & user-agent) by some sites that use Cloudflare, but not necessary all.

But the rule of thumb is, if you end up with a "wrong" user-agent / cookie mismatch, the site won't allow connections at all, meaning you'd see HTTP 403 errors for example.

It is theoretically possible (but really, theoretically) for a site to not show any errors and instead do something like mangling with the response sent to the user.

But I only mention this because it is, again, theoretically possible..
For a site like gfycat, that was planned to shutdown? Extremely, extremely unlikely. So you should be fine.

View full answer

hrxn · 2023-09-03T18:58:24Z

hrxn
Sep 3, 2023

Well, it depends. This gets checked (i.e., matching cookies & user-agent) by some sites that use Cloudflare, but not necessary all.

But the rule of thumb is, if you end up with a "wrong" user-agent / cookie mismatch, the site won't allow connections at all, meaning you'd see HTTP 403 errors for example.

It is theoretically possible (but really, theoretically) for a site to not show any errors and instead do something like mangling with the response sent to the user.

But I only mention this because it is, again, theoretically possible..
For a site like gfycat, that was planned to shutdown? Extremely, extremely unlikely. So you should be fine.

0 replies

hrxn · 2023-09-12T14:51:44Z

hrxn
Sep 12, 2023

Ping @cheese529

Any further questions here?

Or can this discussion be closed?

0 replies

cheese529 · 2023-09-13T10:02:59Z

cheese529
Sep 13, 2023
Author

Sorry, Just saw this. Thank you for the very detailed and easy to understand explanation. And just to be crystal clear, the mismatch would cause 403 errors only if the site has cloudfare enabled right?

6 replies

cheese529 Sep 18, 2023
Author

do sites like reddit and imgur also have this same type of protection?

hrxn Sep 19, 2023

Not that I know of, never seen a Cloudflare error with regard to these sites.
But they obviously have each their own limitations in place for API access etc.

cheese529 Sep 19, 2023
Author

Ahh I understand more now. I'm assuming API access also requires the user agent to be correct at all times?

hrxn Sep 19, 2023

No, usually not. You send some kind of token to the API to identify you, then the API does not care at all about the user-agent.

cheese529 Sep 20, 2023
Author

Ahhhh I see, Man you have been so helpful at explaining this stuff and making it super easy to digest, all of my questions have been answered. Thank you very much :)

Charlotte-br560 · 2024-03-11T09:30:08Z

Charlotte-br560
Mar 11, 2024

The user agent in gallery-dls config is crucial for web scraping, as it helps mimic a browser request. Using an older user agent may impact the accuracy of your scraping results, potentially missing out on content that relies on updated browser features or compatibility. Considering your scenario, it might be insightful to check how Crawlbase handles user agents in its configurations for web scraping. Their approach could offer guidance on optimizing user agent settings for effective and up-to-date content retrieval. Good luck!

3 replies

hrxn Mar 11, 2024

Is this supposed to be an ad or something? 🙄

mikf Mar 11, 2024
Maintainer

Reminds me of #4618 (comment), which has a very similar ring to it and also mentions Crawlbase ™️.
Feels kind of AI generated, if you ask me.

hrxn Mar 11, 2024

Wow.. sure seems like it 😄

Uh oh!

Importance of User Agent With Gallery-dl #4497

Uh oh!

Replies: 4 comments · 9 replies

Uh oh!

Uh oh!

Uh oh!

cheese529 Sep 13, 2023 Author

Uh oh!

cheese529 Sep 18, 2023 Author

Uh oh!

Uh oh!

cheese529 Sep 19, 2023 Author

Uh oh!

Uh oh!

cheese529 Sep 20, 2023 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mikf Mar 11, 2024 Maintainer

Uh oh!

Replies: 4 comments 9 replies

cheese529
Sep 13, 2023
Author

cheese529 Sep 18, 2023
Author

cheese529 Sep 19, 2023
Author

cheese529 Sep 20, 2023
Author

mikf Mar 11, 2024
Maintainer