Jaegertracing in Ceph 101 and my struggles till now.

Hey there! o/

I am interning with Ceph for Outreachy Summer, 2019. You can read about my experience applying here.

To begin with I must say this is a realisation, there are just two states to developers.

Image result for two states of developer meme
Credits: google images

Working with large codebase like ceph. My first conversation with my mentor was how should I go by studying this overwhelming codebase.

He then gave me some outline on where to begin. How to approach, moreover I got to learn the most, discussing and seeing how they see things, one of which was catching an error I was having while setting up my environment.

According to my mentor Josh Durgin,

An engineer must know how to read a codebase, it is so important that you can reach the call stack and debug at much deeper level.

As we enter the third week, I am gonna share how it has all been along with my experience of how to read Ceph source code and troubleshooting issues.

I’ll describe it further, after a brief outline about the project, so that you get the context.

What am I working on?

TL;DR I’ll be working with Ceph on adding jaegertracing tracepoints, which will make Ceph more robust and easier to trace and hence debug

What is Ceph?

Ceph is a large scale distributed storage system, that intelligently handles Peta byte scale of data, it is based on RADOS ( Reliable, Autonomic Distributed Object Store ). Ceph is a huge codebase, with passionate people working on fast paced development, with daily report meeting/ discussion, community events, hackathon and an annual meet Cephalacon. Read more about Ceph here.

What is distributed tracing?  

With the increasing complexity of the existing system the need for debugging these complex systems arises, these can’t be sufficed with the mere print statements and tests. They need to have discrete distributed system of tracing. A very good example of which could be seen in this picture, tracing a monolithic service makes things invisible even if one system fails all further operations can’t be traced, which fails the whole motto of tracing. So the need of easy to integrate, discrete, standard and capable of generating insights tracing system arose in cloud community, but with no standardization, no system could provide the wholesome functionality that were needed.

You can learn more about tracing reading about Dapper, Lightstep and Zipkin

Why Ceph chose to use distributed tracing?

With the coming of standard libraries that are well maintained with active contributors. like Jaeger, the distributed tracing technology that was existing since a decade, but could not be used because of ease of standardization can now be easily integrated. Ceph can reach to a much improved tracing system, as Ceph has a lot of background processes which the existing Lttng, Blkin, Babel trace and Zipkin compromised system does not do justice to and has a learning curve without much insights, contrasting to which Jaeger presents a nice UI representation of traces, which can then be used to gather insights.

If this could successfully be achieved, it could transform the way Ceph is being debugged. Hence, making “Ceph faster in face of failure”

This was the basic overview of the project now talking about the problems I faced till now:

Setting up the development environment

The most troublesome step in the whole process was building the project, although I had earlier build Ceph before applying, but it didn’t include the Jaeger Libraries that Ceph would need to add instrumentation, first thing was to figure that out. I cleaned that code and locally installed the libraries, yet there were some issues that came up; I tried fixing them in dependencies code, but have to resort to commenting them from source and rebuilding as they were just some check, that did the trick.

Vstart troubleshoot

The most significant problem I came across was due to some conflicting code in Ceph, my build even after being successful was not starting, then my mentor oriented me about how to track that error down. I not only solved the issue but also learned the whole process of debugging Ceph, which I’ll add extensively in coming posts hopefully.

Communication and working in different time zones

My mentors belongs to EST time zone and I to IST, initially I had problem adjusting to EST time zone but I now prefer working those hours as; whenever I have an issue I would not have to wait a day to discuss it, I leave a message on slack and if I am still not able to solve it, then I ping my mentor, this way they also can be aware of what I am doing.

Asking for help and why Ceph is awesome!

Working remotely teaches you the value of communication, it’s when you get stuck and want to just bang your head, you reach out. The great thing about Ceph is it’s community, they are a bunch of compassionate people who are there for you, with communication being backbone for their growth. I participate in daily meetings where we update about our progress and discuss issues if any, along with this I have really supportive mentors, Josh, Neha, Ali with whom I get my doubts over audio/video, irc, chat and mail.

My experience had been awesome till now and I hope it continues to be same, I’ll write about where to look at when troubleshooting Ceph and write more about my experiences in coming posts.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s