-
|
So during the last few days I tried to archive as much of gfycat as I could. Ended up with about almost 200GB worth of content. Now unfortunately I did not realize this at the time but in my config the user agent I had was How important is the user agent inside gallery-dls config and how much of a negative impact do you guys think this could have had on the scraping that I did? Is there a chance that I'd have missed some content due to using an older user agent ? |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 9 replies
-
|
Well, it depends. This gets checked (i.e., matching cookies & user-agent) by some sites that use Cloudflare, but not necessary all. But the rule of thumb is, if you end up with a "wrong" user-agent / cookie mismatch, the site won't allow connections at all, meaning you'd see HTTP 403 errors for example. It is theoretically possible (but really, theoretically) for a site to not show any errors and instead do something like mangling with the response sent to the user. But I only mention this because it is, again, theoretically possible.. |
Beta Was this translation helpful? Give feedback.
-
|
Ping @cheese529 Any further questions here? Or can this discussion be closed? |
Beta Was this translation helpful? Give feedback.
-
|
Sorry, Just saw this. Thank you for the very detailed and easy to understand explanation. And just to be crystal clear, the mismatch would cause 403 errors only if the site has cloudfare enabled right? |
Beta Was this translation helpful? Give feedback.
-
|
The user agent in gallery-dls config is crucial for web scraping, as it helps mimic a browser request. Using an older user agent may impact the accuracy of your scraping results, potentially missing out on content that relies on updated browser features or compatibility. Considering your scenario, it might be insightful to check how Crawlbase handles user agents in its configurations for web scraping. Their approach could offer guidance on optimizing user agent settings for effective and up-to-date content retrieval. Good luck! |
Beta Was this translation helpful? Give feedback.
Well, it depends. This gets checked (i.e., matching cookies & user-agent) by some sites that use Cloudflare, but not necessary all.
But the rule of thumb is, if you end up with a "wrong" user-agent / cookie mismatch, the site won't allow connections at all, meaning you'd see HTTP 403 errors for example.
It is theoretically possible (but really, theoretically) for a site to not show any errors and instead do something like mangling with the response sent to the user.
But I only mention this because it is, again, theoretically possible..
For a site like gfycat, that was planned to shutdown? Extremely, extremely unlikely. So you should be fine.