I find that many data teams operate from a tool-based frame of reference. This can include big data platforms, relational databases, visualization or reporting platforms, or statistical tools.
I tend to focus more on concepts and techniques. When working with clients, I wait before recommending particular tools. A good choice depends on the particulars of each situation, including the available skills and the analysis goals in question. That said, many different data tools are excellent, and will do a fine job in the right hands.
However, over the years I have also seen teams get burned by tool choices that just weren’t right for them. I’ve also worked as a developer for data-tool vendors, and seen that “side of the street.” So I feel obliged to pitch in with a few “do’s and don’ts” for teams that are looking for new data tools.
- Do have someone advocate your position, and don’t rely purely on vendor information. Vendors are advocates – they want to sell you something. Use a consultant – internal or external – to help you navigate the process and assure your requirements are met. It’s not your job to be an expert on every tool’s feature set.
- Do know what you want to accomplish, and have a list of core requirements (five to ten is a good number). Use those requirements as the basis for a purchase. A friend of mine worked with a team that needed optimization software, and then spent a great deal of money on a commercial statistics suite devoid of optimization features. That really hurts…
- Don’t assume that buying a tool means an actual problem has been solved, even after a “demonstration” of the tool’s capabilities with your data. I’ve seen what is inside those demonstrations (and participated in some) – there is definite value, but take it with a grain of salt, too. Related: don’t ask a vendor to solve a real problem for you as an unpaid contingency of sale. It will guarantee nothing – they are not invested in your outcome, and don’t know your data. Many teams will spend at least 10 times the tool outlay to develop their production application.
- Don’t simply assume shareware has lower total cost of ownership. Particularly for infrastructure tools the total cost of shareware is often higher, and so is the risk. Shareware can entail conflicting support information and expensive human expertise. On the other hand, shareware is sometimes the best route to the most innovative technologies, and there are shareware suites (including R and some Apache tools) that have large and stable user communities.
- Do prove out your scaling as soon as possible (ideally when you are still on evaluation). Load any test data you like, but demonstrate scaling before you buy or upgrade. That requires some planning, but it’s also a place where teams can get burned.
- Do consider security and integration as part of your requirements. I’ve personally worked on retrofitting security and integration – both are expensive.
- Do prioritize rock-solid performance of basic features and system integration (have a list of what you want) over vendor features. I’ve worked for vendors – esoteric features, in particular, often don’t perform as advertised.
- Do consider using R first, if you entering the realm of quantitative analysis or statistics work. Particularly in quantitative work, I’ve found that concepts and techniques are often more important than tools. So why not start with a tool that is free, has a gigantic user community, and frequently makes new techniques available before anyone else? After you are comfortable you might change, but you’ll then have the knowledge to make a better decision. (By the way, this is a change for me. Shareware can be a problem for people jumping in – but the community and tool are now mature enough I feel it’s a solid recommendation.)