Overview
This guide provides basic steps to troubleshoot a topology. These are starting steps to troubleshoot potential issues and identify root causes easily.
This guide is organized into following broad sections:
This guide is useful for topology developers. Issues related to Heron configuration setup or
its internal architecture, like schedulers
, etc, are discussed in Configuration and Heron Developers respectively, and not discussed here.
Determine topology running status and health
1. Estimate your data rate
It is important to estimate how much data a topology is expected to consume.
A useful approach is to begin by estimating a data rate in terms of items per minute. The emit count (tuples per minute) of each spout should match the data rate for the corresponding data
stream. If spouts are not consuming and emitting the data at the same rate as it
is produced, this is called spout lag
.
Some spouts, like Kafka Spout
have a lag metric that can be
directly used to measure health. It is recommended to have some kind of lag
metric for a custom spout, so that it’s easier to check and create monitoring alerts.
2. Absent Backpressure
Backpressure initiated by an instance means that the concerned instance is not able to consume data at the same rate at which it is being receiving. This results in all spouts getting clamped (they will not consume any more data) until the backpressure is relieved by the instance.
Backpressure is measured in milliseconds per minute, the time an instance was under backpressure. For example, a value of 60,000 means an instance was under backpressure for the whole minute (60 seconds).
A healthy topology should not have backpressure. Backpressure usually results in the spout lag build up since spouts get clamped, but it should not be considered as a cause, only a symptom.
Therefore, adjust and iterate Topology until backpressure is absent.
3. Absent failures
Failed tuples are generally considered bad for a topology, unless it is a required feature (for instance, lowest possible latency is needed at the expense of possible dropped tuples). If
acking
is disabled, or even when enabled and not handled properly in spouts,
this can result in data loss, without adding spout lag.
Identify topology problems
1. Look at instances under backpressure
Backpressure metrics identifies which instances have been under backpressure. Therefore, jump directly to the logs of that instance to see what is going wrong with the instance. Some of the known causes of backpressure are discussed in the frequently seen issues section below.
2. Look at items pending to be acked
Spouts export a metric which is a sampled value of the number of tuples
still in flight in the topology. Sometimes, max-spout-pending
config limits
the consumption rate of the topology. Increasing that spout’s parallelism
generally solves the issue.
Frequently seen issues
1. Topology does not launch
Symptom - Heron client fails to launch the topology.
Note that heron client will execute the topology’s main
method on the local
system, which means spouts and bolts get instantiated locally, serialized, and then
sent over to schedulers as part of topology.defn
. It is important to make sure
that:
- All spouts and bolts are serializable.
- Don’t instantiate a non-serializable attribute in constructor. Leave those to
a bolt’s
prepare
or a spout’sopen
method, which gets called during start time of the instances. - The
main
method should not try to access anything that your local machine may not have access to.
2. Topology does not start
We assume here that heron client has successfully launched the topology.
Symptom - Physical plan or logical plan does not show up on UI
Possible Cause - One of more of stream managers have not yet connected to Tmaster.
What to do -
Go to the Tmaster logs for the topology. The zeroth container is reserved for Tmaster. Go to the container and browse to
log-files/heron-tmaster-.INFO and see which stream managers have not yet connected. The
stmgr
ID corresponds to the container number. For example,stmgr-10
corresponds to container 10, and so on.Visit that container to see what is wrong in stream manager’s logs, which can be found in
log-files
directory similar to Tmaster.
3. Instances are not starting up
A topology would not start until all the instances are running. This may be a cause of a topology not starting.
Symptom - The stream manager logs for that instance never showed that the instance connected to it.
Possible Cause - Bad configs being passed when the instance process was getting launched.
What to do -
Visit the container and browse to
heron-executor.stdout
andheron-executor.stderr
files. All commands to instantiate the instances and stream managers are redirected to these files.Check jvm configs for anything amiss.
If
Xmx
is too low, increasecontainerRAM
orcomponentRAM
. Note that because heron sets aside some RAM for its internal components, like stream manager and metrics manager, having a large number of instances and lowcontainerRAM
may starve off these instances.
4. Metrics for a component are missing/absent
Symptom - The upstream component is emitting data, but this component is not executing any, and no metrics are being reported.
Possible Cause - The component might be stuck in a deadlock. Since one
instance is a single JVM process and user code is called from the main thread,
it is possible that execution is stuck in execute
method.
What to do -
Check logs for one of the concerned instances. If
open
(in a spout) orprepare
(in a bolt) method is not completed, check the code logic to see why the method is not completed.Check the code logic if there is any deadlock in a bolt’s
execute
or a spout’snextTuple
,ack
orfail
methods. These methods should be non-blocking.
5. There is backpressure from internal bolt
Bolts are called internal if it does not talk to any external service. For example, the last bolt might be talking to some database to write its results, and would not be called an internal bolt.
This is invariably due to lack of resources given to this bolt. Increasing parallelism or RAM (based on code logic) can solve the issue.
6. There is backpressure from external bolt
By the same definition as above, an external bolt is the one which is accessing an external service. It might still be emitting data downstream.
Possible Cause 1 - External service is slowing down this bolt.
What to do -
Check if the external service is the bottleneck, and see if adding resources to it can solve it.
Sometimes, changing bolt logic to tune caching vs write rate can make a difference.
Possible Cause 2 - Resource crunch for this bolt, just like an internal bolt above.
What to do -
- This should be handled in the same was as internal bolt - by increasing the parallelism or RAM for the component.