Agile on data analytics
Posted by paulJun 13
I have been on a Data Management team for almost 9 months now. I am part of the technical team that complete many of the requests initiated by the business. When I first started, the choice over tools to produce reports are fairly standard, upstream teams populate a number of staging tables in an Oracle database in a schema belonging to our team.
We, the technical data team use either Toad or sqlplus to extract and data from tables and produce the necessary reports based on requirements. Sounds pretty straight forward right? Well sorta.
Now some other context for the team. The infrastructure we use is shared amongst many other teams whose activities often take priorities. Often we notice instances of instability with the Oracle database which in turn cause frustration and delays with our work. Unfortunately this is something we had to live with for the time being, and I believe this is often the case for many projects out there.
Around early December last year, it was decided to provide DataStage access to my team. This was intended to help alleviate some of the strain put on the Database, and hopefully migrate some of the intensive processing to a Sun Grid environment where DataStage jobs are executed.
This brings me to the Agile part of this post. As a technical team we are faced with a number of tasks to do while at the same time maintaining standard operational duties (i.e. keep those reports coming). DataStage as a tool and an environment was newish to the team overall, but not completely unfamiliar of.
So the team is under some pressure if you could understand. We are given a new tool to learn and use, we need to convert some of the existing reports in SQL to DataStage, and at the same time re-structure (at least the Data part), so that we can better utilise the infrastructure as to put less load on it.
At the time, I thought some of the Agile methods I have learnt may be relevant in this instance.
So this is what we did do:
1. All reports are analyzed, with an aim to streamline the overall process. Due to the time constraints, we thought to ‘pilot’ some of the most urgent and pressing reports. This is similar to identify the ‘must-haves’ features for a software.
2. Reports are assigned to individual developers with the aim to convert them to DataStage jobs. Every day at the start, we did a quick stand-up (~15-20 minutes), to track progress. I had all the development tasks on a whiteboard, and I encouraged developers to update them instead of emails. This actually helped the conversion process a lot, as we start to discover synergies amongst the reports which helped with 1. Whole iterations are about 2-week long, with concrete deliverables.
3. We failed early, a couple of things that we did in SQL was deemed technical challenging to do in DataStage, so we decided to have hybrid jobs where much of the heavy lifting of data processing is done in DataStage, and the remaining is done in SQL.
4. We adopted pair-programming. Initially when we first started on developing DataStage jobs, we generally have 2 people work on the job at the start. And when it has progressed to a degree, the team members may be inter-changed as to foster knowledge sharing. This seems to happen naturally for us, we already have a peer-review process in place.
Now the outcome, over the past 3 months I find things go a lot smoother for us. We still had the occasional database downtimes, but we were still able to mitigate much of the risks around that. There have been times also when the DataStage grid environment was heavily used, so we had delays there too. But the redundancy of data being available in both the database and the Grid environment, meant that we have more flexibility. But this is more a by-product of what has changed in the team not a direct result of being Agile.
Some of the real benefits of Agile for our team are:
1. We seem to have a far better understanding and visibility of what is available to us and what we could provide. Silos of knowledge in individual members are now shared much more readily and willingly.
2. We take far greater ownership of the work we have, comparing to the ’spoke-model’ where work is directed from the team lead. We seem to have more parallel streams approach to work.
3. Mistakes and errors are detected early and resolved early, this combined with 2. meant we achieved higher quality in the reports that we deliver and in shorter timeframe.
Development of DataStage jobs follow a ‘promotion’ model. Where individual developer’s sandbox is used for much of the development activities, and when the job is peer-reviewed and deemed good enough for prime time, it is then ‘promoted’ to the released folder. We also have a ‘deprecated’ folder to renew some the jobs.
The next activity I’d like to focus is to concentrate on further align the data we have with well designed reporting schema that suite our needs to further reduce the load put some of the common data we use.
As always your comments are welcome, and would certainly love to know what you did or tried when your technical team is faced with something similar like ours.
~paul
