better CPU metric support #18

dpocock · 2014-04-24T09:36:23Z

JMX provides the ProcessCpuTime metric. It is the number of nanoseconds of CPU time used by the JVM.

Graphing the raw value in Ganglia is unhelpful.

gmond itself sends cpu_* metrics (e.g. cpu_system) and these are percentages derived from the raw tick values of the host. jmxetric could derive percentages from the JVM values and send those instead. This would be more consistent with the Ganglia design and reduce network load when values are not changing.

We should also consider how this should work for multi-threaded Java apps on multi-CPU systems.

Other projects like jmxtrans have examples of working with these values.

dpocock · 2014-07-01T14:02:22Z

@ngzhian Zhi An, could you please have an initial look at this issue and let me know your observations about it?

ngzhian · 2014-07-04T13:10:51Z

@dpocock by percentage do you mean CPU time used by JVM over total CPU time?

dpocock · 2014-07-04T13:12:49Z

Actually, we need to work out the correct definition for this.

Have you seen the CPU graphs in JConsole? I believe it takes the same raw values from JMX. You could try making some simple code that starts 2 threads, each using 100% of a CPU core and see how that looks in JConsole for a CPU with 4 or more cores.

ngzhian · 2014-07-06T06:29:28Z

Going back to gmond monitor-core cpu_user_func, we observe that the value is calculated as such

diff = user_jiffies - last_user_jiffies;
val.f = ((double)diff/(double)(total_jiffies - last_total_jiffies)) * 100.0;

which translates to, portion of total CPU time spent on user activity over a particular time period.

So a simple way is to use take (ProcessCpuTime/Uptime * 100), where ProcessCpuTime is the "Total amount of CPU time that the Java VM has consumed since it was started" and Uptime is the "total amount of time since the JVM was started". This will result in a percentage value in the range [0, 100] that will fluctuate as the JVM is running. We could store a lastProcessCpuTime and lastUptime, then take the difference, in that case the calculation will be similar to the one in monitor-core.

I have taken a few pictures and also wrote a simple program that loops and uses CPU cycles so that I could examine jconsole to verify the numbers, and also took some screenshots of the values in jconsole. These will be updated on a blog post soon.

Also I noted some points, such as:

Uptime is given in milliseconds
ProcessCpuTime in nanoseconds
we can't distinguish between system, user CPU time, unlike the monitor-core implementation.
we can only poll all running threads for their ThreadCpuTime and ThreadUserTime and do our own calculation if we wish to provide the different CPU time.
at certain times, ProcessCpuTime is larger than Uptime, I will need to investigate this further
each Thread can have its own cpu usage as well

ngzhian · 2014-07-11T06:28:07Z

Updated everything in this blog post.
As of now every JMX metric value is retrieved using this code

Object o = mbs.getAttribute(objectName, attributeName);

This will make it difficult to make modifications to the raw value of the mbeans, as in this case of sending percentage values.
I'm still thinking of a a good way to incorporate this requirement into the current system.

dpocock · 2014-08-06T08:15:33Z

Can you just clarify one other thing about multi-CPU: lets say that you have the following:

server has 4 CPU cores
JVM runs for 60 seconds
2 threads are running, each using 100% of a CPU core for the whole 60 seconds

What would the actual calculations and results be in this case?

E.g. Uptime = 60 seconds
available CPU time = 4_60 seconds (because we have 4 cores)?
ProcessCpuTime = 2_60 seconds (because we have 2 threads)?
result = ProcessCpuTime / 60 = 200%?
or ProcessCpuTime / (4*60) = 50%?

What does JConsole show in this case?

You don't need to worry about the other things from mod_multicpu, e.g. the system or user jiffies. Just getting Ganglia to show something comparable to the JConsole CPU graph is sufficient.

ngzhian · 2014-08-16T07:04:01Z

The way jconsole does it is:
ProcessCpuTime / Uptime / #processors,
hence its 120 / 60 / 4 = 50%.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

better CPU metric support #18

better CPU metric support #18

dpocock commented Apr 24, 2014

dpocock commented Jul 1, 2014

ngzhian commented Jul 4, 2014

dpocock commented Jul 4, 2014

ngzhian commented Jul 6, 2014

ngzhian commented Jul 11, 2014

dpocock commented Aug 6, 2014

ngzhian commented Aug 16, 2014

better CPU metric support #18

better CPU metric support #18

Comments

dpocock commented Apr 24, 2014

dpocock commented Jul 1, 2014

ngzhian commented Jul 4, 2014

dpocock commented Jul 4, 2014

ngzhian commented Jul 6, 2014

ngzhian commented Jul 11, 2014

dpocock commented Aug 6, 2014

ngzhian commented Aug 16, 2014