-
Notifications
You must be signed in to change notification settings - Fork 513
feat: Add basic OpenTelemetry instrumentation
#1255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Changes from 12 commits
Commits
Show all changes
15 commits
Select commit
Hold shift + click to select a range
47474ca
Draft with basic instrumentor and example
Pijukatel c6b5ab4
Implement uninstrument.
Pijukatel c7ea48e
Finalized instrumentor and example.
Pijukatel 0ef03a4
Example finished and moved to it's proper place.
Pijukatel 0902394
Add test. Polish docs.
Pijukatel e5a0c29
Merge remote-tracking branch 'origin/master' into otel-test
Pijukatel 9feb417
Merge remote-tracking branch 'origin/master' into otel-test
Pijukatel 45eecea
Apply suggestions from code review
Pijukatel 292b937
Review comments
Pijukatel 9c281e4
Merge remote-tracking branch 'origin/master' into otel-test
Pijukatel 322f6f6
Update open_telemetry.mdx
Pijukatel 799740a
Update open_telemetry.mdx
Pijukatel f6c1c7f
Apply suggestions from code review
Pijukatel d0a2ee6
Update name and title
Pijukatel 5eaea80
Update example folder name and link to it
Pijukatel File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
57 changes: 57 additions & 0 deletions
57
docs/guides/code_examples/open_telemetry/instrument_crawler.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,57 @@ | ||
| import asyncio | ||
|
|
||
| from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter | ||
| from opentelemetry.sdk.resources import Resource | ||
| from opentelemetry.sdk.trace import TracerProvider | ||
| from opentelemetry.sdk.trace.export import SimpleSpanProcessor | ||
| from opentelemetry.trace import set_tracer_provider | ||
|
|
||
| from crawlee.crawlers import BasicCrawlingContext, ParselCrawler, ParselCrawlingContext | ||
| from crawlee.otel import CrawlerInstrumentor | ||
| from crawlee.storages import Dataset, KeyValueStore, RequestQueue | ||
|
|
||
|
|
||
| def instrument_crawler() -> None: | ||
| """Add instrumentation to the crawler.""" | ||
| resource = Resource.create( | ||
| { | ||
| 'service.name': 'ExampleCrawler', | ||
| 'service.version': '1.0.0', | ||
| 'environment': 'development', | ||
| } | ||
| ) | ||
|
|
||
| # Set up the OpenTelemetry tracer provider and exporter | ||
| provider = TracerProvider(resource=resource) | ||
| otlp_exporter = OTLPSpanExporter(endpoint='localhost:4317', insecure=True) | ||
| provider.add_span_processor(SimpleSpanProcessor(otlp_exporter)) | ||
| set_tracer_provider(provider) | ||
| # Instrument the crawler with OpenTelemetry | ||
| CrawlerInstrumentor( | ||
| instrument_classes=[RequestQueue, KeyValueStore, Dataset] | ||
| ).instrument() | ||
|
|
||
|
|
||
| async def main() -> None: | ||
| """Run the crawler.""" | ||
| instrument_crawler() | ||
|
|
||
| crawler = ParselCrawler(max_requests_per_crawl=100) | ||
| kvs = await KeyValueStore.open() | ||
|
|
||
| @crawler.pre_navigation_hook | ||
| async def pre_nav_hook(_: BasicCrawlingContext) -> None: | ||
| # Simulate some pre-navigation processing | ||
| await asyncio.sleep(0.01) | ||
|
|
||
| @crawler.router.default_handler | ||
| async def handler(context: ParselCrawlingContext) -> None: | ||
| await context.push_data({'url': context.request.url}) | ||
| await kvs.set_value(key='url', value=context.request.url) | ||
| await context.enqueue_links() | ||
|
|
||
| await crawler.run(['https://crawlee.dev/']) | ||
|
|
||
|
|
||
| if __name__ == '__main__': | ||
| asyncio.run(main()) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,54 @@ | ||
| --- | ||
| id: otel | ||
Pijukatel marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| title: Trace and optimize crawlers | ||
| description: How to instrument crawlers with OpenTelemetry | ||
Pijukatel marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| --- | ||
|
|
||
| import ApiLink from '@site/src/components/ApiLink'; | ||
| import CodeBlock from '@theme/CodeBlock'; | ||
|
|
||
| import InstrumentCrawler from '!!raw-loader!./code_examples/open_telemetry/instrument_crawler.py'; | ||
|
|
||
| [OpenTelemtery](https://opentelemetry.io/) is a collection of APIs, SDKs, and tools to instrument, generate, collect, and export telemetry data (metrics, logs, and traces) to help you analyze your software’s performance and behavior. In the context of crawler development, it can be used to better understand how the crawler internally works, identify bottlenecks, debug, log metrics, and more. The topic described in this guide requires at least a basic understanding of OpenTelemetry. A good place to start is [What is open telemetry](https://opentelemetry.io/docs/what-is-opentelemetry/). | ||
|
|
||
| In this guide, it will be shown how to set up OpenTelemetry and instrument a specific crawler to see traces of individual requests that are being processed by the crawler. OpenTelemetry on its own does not provide out of the box tool for convenient visualisation of the exported data (apart from printing to the console), but there are several good available tools to do that. In this guide, we will use [Jaeger](https://www.jaegertracing.io/) to visualise the telemetry data. To better understand concepts such as exporter, collector, and visualisation backend, please refer to the [OpenTelemetry documentation](https://opentelemetry.io/docs/collector/). | ||
|
|
||
|
|
||
| ## Set up the Jaeger | ||
Pijukatel marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| This guide will show how to set up the environment locally to run the example code and visualize the telemetry data in Jaeger that will be running locally in a [docker](https://www.docker.com/) container. | ||
|
|
||
| To start the preconfigured Docker container, you can use the following command: | ||
|
|
||
| ```bash | ||
| docker run -d --name jaeger -e COLLECTOR_OTLP_ENABLED=true -p 16686:16686 -p 4317:4317 -p 4318:4318 jaegertracing/all-in-one:latest | ||
| ``` | ||
| For more details about the Jaeger setup, see the [getting started](https://www.jaegertracing.io/docs/2.7/getting-started/) section in their documentation. | ||
| You can see the Jaeger UI in your browser by navigating to http://localhost:16686 | ||
|
|
||
| ## Instrument the Crawler | ||
|
|
||
| Now you can proceed with instrumenting the crawler to send the telemetry data to Jaeger and running it. To have the Python environment ready, you should install either **crawlee[all]** or **crawlee[otel]**, This will ensure that OpenTelemetry dependencies are installed, and you can run the example code snippet. | ||
| In the following example, you can see the function `instrument_crawler` that contains the instrumentation setup and is called before the crawler is started. If you have already set up the Jaeger, then you can just run the following code snippet. | ||
|
|
||
vdusek marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| <CodeBlock className="language-python"> | ||
| {InstrumentCrawler} | ||
| </CodeBlock> | ||
vdusek marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ## Analyze the results | ||
vdusek marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| In the Jaeger UI, you can search for different traces, apply filtering, compare traces, view their detailed attributes, view timing details, and more. For the detailed description of the tool's capabilities, please refer to the [Jaeger documentation](https://www.jaegertracing.io/docs/1.47/deployment/frontend-ui/#trace-page). | ||
|
|
||
|  | ||
|  | ||
|
|
||
| You can use different tools to consume the OpenTelemetry data that might better suit your needs. Please see the list of known Vendors in [OpenTelemetry documentation](https://opentelemetry.io/ecosystem/vendors/). | ||
|
|
||
| ## Customize the instrumentation | ||
|
|
||
| You can customize the <ApiLink to="class/CrawlerInstrumentor">`CrawlerInstrumentor`</ApiLink>. Depending on the arguments used during its initialization, the instrumentation will be applied to different parts ot the Crawlee code. By default, it instruments some functions that can give quite a good picture of each individual request handling. To turn this default instrumentation off, you can pass `request_handling_instrumentation=False` during initialization. You can also extend instrumentation by passing `instrument_classes=[...]` initialization argument that contains classes you want to be auto-instrumented. All their public methods will be automatically instrumented. Bear in mind that instrumentation has some runtime costs as well. The more instrumentation is used, the more overhead it will add to the crawler execution. | ||
|
|
||
|
|
||
| You can also create your instrumentation by selecting only the methods you want to instrument. For more details, see the <ApiLink to="class/CrawlerInstrumentor">`CrawlerInstrumentor`</ApiLink> source code and the [Python documentation for OpenTelemetry](https://opentelemetry.io/docs/languages/python/). | ||
Pijukatel marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/crawlee-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
vdusek marked this conversation as resolved.
Show resolved
Hide resolved
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| from crawlee.otel.crawler_instrumentor import CrawlerInstrumentor | ||
|
|
||
| __all__ = [ | ||
| 'CrawlerInstrumentor', | ||
| ] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,152 @@ | ||
| from __future__ import annotations | ||
|
|
||
| import inspect | ||
| from typing import TYPE_CHECKING, Any | ||
|
|
||
| from opentelemetry.instrumentation.instrumentor import ( # type:ignore[attr-defined] # Mypy has troubles with OTEL | ||
| BaseInstrumentor, | ||
| ) | ||
| from opentelemetry.instrumentation.utils import unwrap | ||
| from opentelemetry.semconv.attributes.code_attributes import CODE_FUNCTION_NAME | ||
| from opentelemetry.semconv.attributes.http_attributes import HTTP_REQUEST_METHOD | ||
| from opentelemetry.semconv.attributes.url_attributes import URL_FULL | ||
| from opentelemetry.trace import get_tracer | ||
| from wrapt import wrap_function_wrapper | ||
|
|
||
| from crawlee._utils.docs import docs_group | ||
| from crawlee.crawlers import BasicCrawler, ContextPipeline | ||
| from crawlee.crawlers._basic._context_pipeline import _Middleware | ||
|
|
||
| if TYPE_CHECKING: | ||
| from collections.abc import Callable | ||
|
|
||
| from crawlee.crawlers import BasicCrawlingContext | ||
|
|
||
|
|
||
| @docs_group('Classes') | ||
| class CrawlerInstrumentor(BaseInstrumentor): | ||
| """Helper class for instrumenting crawlers with OpenTelemetry.""" | ||
|
|
||
| def __init__( | ||
| self, *, instrument_classes: list[type] | None = None, request_handling_instrumentation: bool = True | ||
| ) -> None: | ||
| """Initialize the instrumentor. | ||
|
|
||
| Args: | ||
| instrument_classes: List of classes to be instrumented - all their public methods and coroutines will be | ||
| wrapped by generic instrumentation wrapper that will create spans for them. | ||
| request_handling_instrumentation: Handpicked most interesting methods to instrument in the request handling | ||
| pipeline. | ||
| """ | ||
| self._tracer = get_tracer(__name__) | ||
|
|
||
| async def _simple_async_wrapper(wrapped: Any, _: Any, args: Any, kwargs: Any) -> Any: | ||
| with self._tracer.start_as_current_span( | ||
| name=wrapped.__name__, attributes={CODE_FUNCTION_NAME: wrapped.__qualname__} | ||
| ): | ||
| return await wrapped(*args, **kwargs) | ||
|
|
||
| def _simple_wrapper(wrapped: Any, _: Any, args: Any, kwargs: Any) -> Any: | ||
| with self._tracer.start_as_current_span( | ||
| name=wrapped.__name__, attributes={CODE_FUNCTION_NAME: wrapped.__qualname__} | ||
| ): | ||
| return wrapped(*args, **kwargs) | ||
|
|
||
| def _init_wrapper(wrapped: Any, _: Any, args: Any, kwargs: Any) -> None: | ||
| with self._tracer.start_as_current_span( | ||
| name=wrapped.__name__, attributes={CODE_FUNCTION_NAME: wrapped.__qualname__} | ||
| ): | ||
| wrapped(*args, **kwargs) | ||
|
|
||
| self._instrumented: list[tuple[Any, str, Callable]] = [] | ||
| self._simple_wrapper = _simple_wrapper | ||
| self._simple_async_wrapper = _simple_async_wrapper | ||
| self._init_wrapper = _init_wrapper | ||
|
|
||
| if instrument_classes: | ||
| for _class in instrument_classes: | ||
| self._instrument_all_public_methods(on_class=_class) | ||
|
|
||
| if request_handling_instrumentation: | ||
|
|
||
| async def middlware_wrapper(wrapped: Any, instance: _Middleware, args: Any, kwargs: Any) -> Any: | ||
| with self._tracer.start_as_current_span( | ||
| name=f'{instance.generator.__name__}, {wrapped.__name__}', # type:ignore[attr-defined] # valid in our context | ||
| attributes={ | ||
| URL_FULL: instance.input_context.request.url, | ||
| CODE_FUNCTION_NAME: instance.generator.__qualname__, # type:ignore[attr-defined] # valid in our context | ||
| }, | ||
| ): | ||
| return await wrapped(*args, **kwargs) | ||
|
|
||
| async def context_pipeline_wrapper( | ||
| wrapped: Any, _: ContextPipeline[BasicCrawlingContext], args: Any, kwargs: Any | ||
| ) -> Any: | ||
| context = args[0] | ||
| final_context_consumer = args[1] | ||
|
|
||
| async def wrapped_final_consumer(*args: Any, **kwargs: Any) -> Any: | ||
| with self._tracer.start_as_current_span( | ||
| name='request_handler', | ||
| attributes={URL_FULL: context.request.url, HTTP_REQUEST_METHOD: context.request.method}, | ||
| ): | ||
| return await final_context_consumer(*args, **kwargs) | ||
|
|
||
| with self._tracer.start_as_current_span( | ||
| name='ContextPipeline', | ||
| attributes={URL_FULL: context.request.url, HTTP_REQUEST_METHOD: context.request.method}, | ||
| ): | ||
| return await wrapped(context, wrapped_final_consumer, **kwargs) | ||
|
|
||
| async def _commit_request_handler_result_wrapper( | ||
| wrapped: Callable[[Any], Any], _: BasicCrawler, args: Any, kwargs: Any | ||
| ) -> Any: | ||
| context = args[0] | ||
| with self._tracer.start_as_current_span( | ||
| name='Commit results', | ||
| attributes={URL_FULL: context.request.url, HTTP_REQUEST_METHOD: context.request.method}, | ||
| ): | ||
| return await wrapped(*args, **kwargs) | ||
|
|
||
| # Handpicked interesting methods to instrument | ||
| self._instrumented.extend( | ||
| [ | ||
| (_Middleware, 'action', middlware_wrapper), | ||
| (_Middleware, 'cleanup', middlware_wrapper), | ||
| (ContextPipeline, '__call__', context_pipeline_wrapper), | ||
| (BasicCrawler, '_BasicCrawler__run_task_function', self._simple_async_wrapper), | ||
| (BasicCrawler, '_commit_request_handler_result', _commit_request_handler_result_wrapper), | ||
| ] | ||
| ) | ||
|
|
||
| def instrumentation_dependencies(self) -> list[str]: | ||
| """Return a list of python packages with versions that will be instrumented.""" | ||
| return ['crawlee'] | ||
|
|
||
| def _instrument_all_public_methods(self, on_class: type) -> None: | ||
| public_coroutines = { | ||
| name | ||
| for name, member in inspect.getmembers(on_class, predicate=inspect.iscoroutinefunction) | ||
| if not name.startswith('_') | ||
| } | ||
| public_methods = { | ||
| name | ||
| for name, member in inspect.getmembers(on_class, predicate=inspect.isfunction) | ||
| if not name.startswith('_') | ||
| } - public_coroutines | ||
|
|
||
| for coroutine in public_coroutines: | ||
| self._instrumented.append((on_class, coroutine, self._simple_async_wrapper)) | ||
|
|
||
| for method in public_methods: | ||
| self._instrumented.append((on_class, method, self._simple_wrapper)) | ||
|
|
||
| self._instrumented.append((on_class, '__init__', self._init_wrapper)) | ||
|
|
||
| def _instrument(self, **_: Any) -> None: | ||
| for _class, method, wrapper in self._instrumented: | ||
| wrap_function_wrapper(_class, method, wrapper) | ||
|
|
||
| def _uninstrument(self, **_: Any) -> None: | ||
| for _class, method, wrapper in self._instrumented: # noqa: B007 | ||
Pijukatel marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| unwrap(_class, method) | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.