The Holy Java

Notes of a passionate Java EE developer

Posts Tagged ‘bigdata’

Most interesting links of April ’13

Posted by Jakub Holý on April 30, 2013

Recommended Readings

The top top article

How To Survive a Ground-Up Rewrite Without Losing Your Sanity (recommended by Kent Beck) – sometimes you need to actually rewrite an important part of a system; here we learn about two such rewrites, one which went well and one that failed badly – and what are the important differences.

The pain of a rewrite: “it’s [a major rewrite] going to take insanely longer than you expect” – because: “there’s this endless series of weird crap encoded in the data in surprising ways” and it takes days to convert them, “It’s brutally hard to reduce scope” (you cannot drop features, edge cases), “There turn out to be these other system that use ‘your’ data”.

To succeed you need: 1) Determine clear business-visible wins to justify the effort that will be much higher than expected and to know when to give up / what to postpone; 2) Do it extremely incrementally (<->  Succession) – break it into a series of small, safe steps, each generating a business value and learning of its own thus enabling early and frequent economical tradeoffs (stop, shift priorities, …) – ex.: rewrite a single reports, migrate its data, switch customers to it, go on to the next one – complete slice of functionality => a more realistic estimate soon => reprioritisation; incrementalism requires you to be able to write data both to the old and new system, which is hard but always pays off: “Here’s what I’m going to say: always insert that dual-write layer. Always. It’s a minor, generally somewhat fixed cost that buys you an incredible amount of insurance.” 3) “Abandoning the Project Should Always Be on the Table” (<- known biz value, better estimate based on early feedback).

Some Specific Tactics: Shrink Ray FTW (a graph of how much has been already replaced => motivation), Engineer The Living Hell Out Of Your Migration Scripts (tests, robustness, error handling, restartability), If Your Data Doesn’t Look Weird, You’re Not Looking Hard Enough.

Methodology, agile, lean

  • M. Fowler: The New Methodology – a good description of the rise of Agile, the motivation for it, the various Agile methodologies (XP, Lean, Scrum etc.) and what is required to be able to apply an agile approach. Main points: Agile is adaptive (vs. predictive) and relies heavily on people and their judgement and skills (vs. treating them as same, replacable units) – which also leads to the need of leadership instead of (command&control) management. Discusses unpredictability of requirements and scope, foolishness of separating design and implementation, difficulty of measurement of SW development, continuous improvement etc. Quotes: “However letting go of predictability doesn’t mean you have to revert to uncontrollable chaos. Instead you need a process that can give you control over an unpredictability. That’s what adaptivity is all about.”
  • The Toyota concept of ‘respect for people’ – many state that they respect their workers but fail to really understand what it means; it is not about freedom of act, it is about a mutual respect, leveraging the strengths of each other: worker’s experience and insight and manager’s broader overview, as demonstrated by the problem-solving dialog and challenges (problem – root cause – solution – measure of success, the manager challenging the worker’s answers). Also a nice example how the evaluation of individual performance leads to a much worse system and high turnover compared to a whole-oriented company.
  • Fixed Bid Agile Without Cognitive Dissonance – a refreshing take on fixed-scope projects and Agile; yes, they are bad but sometimes the client has no other choice so what best we can make out of it? The core advice: Agree “a pragmatic change management protocol (along with a contingency built into the pricing)” (push for lower initial requirements granularity, customer involvement, flexibility of functionality) => “you can gain significant agile benefits for clients who wouldn’t otherwise accept them”.
  • Agile Atlas: Scrum – a good description of Scrum and its values, roles, artifacts, and activities

Learning, psychology, estimates

  • How Developers Stop Learning: Rise of the Expert Beginner – sometimes you meet people with experience-indicating titles that are actually little competent, perhaps leading incompetent IT departments. Why? They, unchallenged by competent peers or broader IT community, came to believe that they are “experts” while actually being only little more advanced beginners, better than their beginner colleagues but still lacking any understanding of the big picture and the knowledge of what they do not know, trapped in the “unconscious incompetence” stage. The post explains this in a more detail and is followed up an explanation how it can lead to the rise of a mediocre SW group in “How Software Groups Rot: Legacy of the Expert Beginner“.
  • Coding, Fast and Slow: Developers and the Psychology of Overconfidence (via @peterskeide) – why are we so bad at estimating (inherent complexity of SW vs. our overconfidence) and why it cannot be fixed. We can learn to somehow estimate tasks of few hours length (less complex, plenty of practice opportunities). The question is: “how you can your dev team generate a ton of value, even though you can not make meaningful long-term estimates?”
  • Cognitive Overhead, Or Why Your Product Isn’t As Simple As You Think (via @JiriJerabek) – to make apps more accessible to users, we try to make them simple – but “simple” might be different from what you expect. The important thing is not less steps, less features, less elements, but lower cognitive overhead, i.e. “how many logical connections or jumps your brain has to make in order to understand or contextualize the thing you’re looking at.” Good examples of unexpectadly high / pleasantly low cognitive overhead, some tips, even suprising ones such as make people do more (to be more involved in the process – e.g. bump their phones), slow down your product.

Other

  • Economies of Scala – a case for using Scala over Java, supported by data: many capable developers want to use it but there are few opportunities for them – and getting developers is one of the main challenges.
  • A canonical Repository test – a nice standard way to test a “DAO”; highlights: use of  FEST assert 2 for clean and nice checks, no unimportant details in the test (f.ex. details of the test data hidden in randomPerson() and randomOder(Person)).
  • How To Think Like An Engineer – some nice ideas such as: “Build A Simple First Version: With People, Not Code” – “Technology is not always the best solution, because technology is not always the simplest solution.”, i.e. don’t automate everything from the start (examples from Netflix, Amazon); “Rather than trying to do everything at once, break down the functions of your company into smaller goals.” – and focus at one at a time
  • Economies Of Scale As A Service (do not mix up with Scala! :-) )- an interesting description of the trend away from ownership to the rental of important resources (servers, manufacturing capabilities, personal cars, …) and the resulting changes in the society, business, and industry
  • Troy Hunt: Our password hashing has no clothes (or the much shorter though biased How To Safely Store A Password) – MD5 and SHA are not safe enough due to brute-force attack enabled by GPUs, irrespective key size; it’s crucial to use hashing algorithms designed for passwords (and thus sufficiently slow) – f.ex. bcrypt, or PBKDF2 or the newer scrypt.
  • Everything about Java 8 – a well-made summary of what should come in Java 8, based on the current state, discussing the finer points: static and default (non-static, overridable) methods on interfaces, lambdas (do I need to mentione that?!) and method references (String::valueOf, Object::toString, myVar::toString, ArrayList::new); good discussion of the various use cases and limitations of lambdas (capturing x non-c., ..); java.util.stream for functional operations on value streams (filter, map, reduce etc.); java.time inspired by Joda, more concurrency utilities (e.g. CompletableFuture for chaining futures); String.join (finally!), Optional ~ Scala’s Option & more; yummy!
  • How To Keep Your Best Programmers – what motivates capable programmers to stay/leave? The author lists some common reasons and concludes that, ultimately, all are linked to the desire for autonomy, mastery, or purpose. However he goes further and proposes that, to keep talented devs, you must offer them an appealing narrative (regarding their actions and a result, related to autonomy/mastery/purpose) and reaffirm/update it frequently; ex.: “With the work that we’re giving you over the next few months, you’re going to become the foremost NoSQL expert in our organization.” “At any point, both you and the developers on your team should know their narratives.” – so that they will be “constant points of job satisfaction and purpose.”

Clojure Corner

  • Clojure Data Analysis Cookbook review – “The book provides a collection of recipes for accomplishing common tasks associated with analyzing different types of data sets. It starts out by showing how to read data from a variety of sources such as JSON, CSV, and JDBC. [..] how to sanitize the collected data and sample large data sets. [..] a number of different strategies for processing it.” How to present them with ClojureScript and  NVD3 (D3.js components). “Some of the highlights include using the Clojure STM, parallel processing of the data, including useful tricks for partitioning, using reducers, and distributed processing with Hadoop and Casalog.”

Favorite Quotes

once again, trying to do it *and* do it right was too much all at once, resulting in little progress and little learning.

- Kent Beck’s tweet 2013-04-16

A true agile development process can be recognized by its continual evolution:

A project that begins using an adaptive process won’t have the same process a year later. Over time, the team will find what works for them, and alter the process to fit.

- Martin Fowler in The New Methodology

Posted in General, SW development, Testing, Top links of month | Tagged: , , , , , | Leave a Comment »

Most interesting links of January ’13

Posted by Jakub Holý on January 31, 2013

Recommended Readings

Various

  • Dustin Marx: Significant Software Development Developments of 2012 – Groovy 2.0 with static typing, rise of Git[Hub], NoSQL, mobile development (iOS etc.), Scala and Typesafe stack 2.0, big data, HTML5, security (Java issues etc.), cloud, DevOps.
  • 20 Kick-ass programming quotes – including Bill Gates’ “Measuring programming progress by lines of code is like measuring aircraft building progress by weight.”,  B.W. Kernighan’s “Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.”, Martin Golding’s “Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live.” (my favorite)
  • How to Have a Year that Matters (via @gbrindusa) – do you want to just survive and collect possessions or do you want to make a difference? Some questions everybody should pose to him/herself.
  • Expression Language Injection – security defect in applications using JSP EL that can sometimes leads to double evaluation of the expressions and thus makes it possible to execute data supplied by the user in request parameters etc. as expressions, affects e.g. unpatched Spring 2.x and 3.

Languages etc.

  • HN discussion about Scala 2.10 – compilation speed and whether it matters, comparison of the speed and type system with Haskell and OCaml, problems with incremental compilation (dependency cycles, fragile base class), some speed up tips such as factoring out subprojects, the pros and cons of implicits etc.
  • Blog Mechanical Sympathy – interesting posts and performance tests regarding “writing software which works in harmony with the underlying hardware to gain great performance” such as Memory Access Patterns Are Important and Compact Off-Heap Structures/Tuples In Java.
  • Neal Ford: Functional thinking: Why functional programming is on the rise – Why you should care about functional programming, even if you don’t plan to change languages any time soon – N. Ford explains the advantages of FP and why FP concepts are spreading into other languages (higher abstractions enabling focus on the results over steps and ceding control to the language, more reusability on a finer level (higher-order functions etc.), few generic data structures with many operations -> better composability, “new” and different tool such as lazy collections, shaping the language towards the problem instead of vice versa, aligning with trends such as immutability)
  • Neal Ford: Java.next: The Java.next languages Leveraging Groovy, Scala, and Clojure in an increasingly polyglot world – a comparison of these languages with focus on what they are [not] suitable for, exploration of their paradigms (static vs. dynamic typing, imperative vs. functional)

SW development

  • How to Completely Fail at BDD – a story of an enthusiastic developer who tried to make everyone’s life better by introducing automated BDD tests and failed due to differences in culture (and inability to change thinking from the traditional testing), a surprising lack of interest in the tool and learning how to write good tests: “Culturally, my current team just isn’t ready or interested in something like this.” Morale: It is hard to change people, good ideas are not enough.
  • M. Feathers: Refactoring is Sloppy – refactoring is often prioritized out of regular development and refactoring sprints/stories aren’t popular due to past failures etc. An counter-intuitive way to get refactoring in is to imagine, during planning, what the code would need to be like to make it easy to implement a story. Then create a task for making it so before the story itself and assign it to somebody else then the story (to force a degree of scrutiny and communication). “Like anything else in process, this is medicine.  It’s not meant to be ‘the way that people do things for all time’ [..]” – i.e. intended for use when you can’t fit refactoring in otherwise. It may also make the cost of the current bad code more visible. Read also the commits (f.ex. the mikado method case).
  • Cyber-dojo: A great way to practice TDD together. Compare your read-green cycle and development over time with other teams. Purposefully minimalistic editor, a number of prepared tdd tasks.
  • On the Dark Side of “Craftsmanship” – an interesting and provoking article. Some developers, the software labouers, want to get work done and go home, they haven’t the motivation and energy to continualy spend time improving themselves. There is nothing wrong with that and we shouldn’t disparge them because of that. We shouldn’t divide people into craftsmen and the bad ones. A summary of and response to the varied reactions follows up in More on “Craftsmanship”. The author is right that we can’t expect everybody to spend nights improving her/his programming skills. Still they should not produce code of poor quality (with few exceptions) since maintaining such code costs a lot. There should be time for enough quality in a 9-5 day and people should be provided with enough guidance and education to be able to write decent code. (Though I’m not sure how feasible it is, how much effort it takes to become an acceptable developer.) Does the increased cost of writing (an learning to write) good code overweight the cost of working with bad code? That is an eternal discussion.

Cloud, web, big data etc.

  • Whom the Gods Would Destroy, They First Give Real-time Analytics (via Leon) – a very reasonable argument against real-time analytics: yes, we want real-time operational metrics but “analytics” only makes sense on a sensible amount of data (for the sake of statistical significance etc.) RT analytics could easily provide misguided results.
    CAP Twelve Years Later: How the “Rules” Have Changed (tl;dr, via @_dagi) – an in-depth discussion of the CAP theorem and the simplification (2 out of 3) that it makes; there are many more nuances. By Eric Brewer, a professor of computer science at the University of California, Berkeley, and vice president of infrastructure at Google.
  • ROCA: Resource-oriented Client Architecture – “A collection of simple recommendations for decent Web application frontends.” Server-side: true REST, no session state, working back/refresh etc. Client: semantic HTML independent of layout, progressive enhancement (usable with older browsers), usable without JS (all logic on the server) etc. Certainly not suitable for all types of apps but worthwile to consider the principles and compare them with your needs.

Clojure Corner

Tools

  • Vaurien, the Chaos TCP Proxy (via @bsvingen) – an extensible proxy that you can control from your tests to simulate network failure or problems such as delays on 20% of the requests; great for testing how an application behaves when facing failures or difficulties with its dependencies. It supports the protocols tcp, http, redis, memcache.
  • Wvanbergen’s request-log-analyzer for Apache, MySQL, PostgreSQL, Rails and more (via Zarko) – generates a performance report from a supported access log to point out requests that might need optimizing
  • Working Effectively With iTerm2 (Mac) – good tips in the body and comments

Favorite Quotes

A very good (though not very scientific) definition of project success applicable for distinguishing truly agile from process-driven projects:

[..] a project is successful if:

  • Something was delivered and put to use
  • The project members, sponsors and users are basically happy with the outcome of the project

- Johannes Brodwall in “How do we become Agile?” and why it doesn’t matter, inspired by Alistair Cockburn

(Notice there isn’t a single word about being “on time and budget”.)

Posted in General, Java, Testing, Tools, Top links of month | Tagged: , , , , , , , , , , | Leave a Comment »

Most interesting links of December ’12

Posted by Jakub Holý on December 31, 2012

Recommended Readings

Software development

  • Kent Beck: When Worse Is Better: Incrementally Escaping Local Maxima – Kent reintroduces his Sprinting Centipede strategy (“reduce the cost of each change as much as possible so as to enable small changes to be chained together nearly continuously” => “From the outside it is clear that big changes are happening, even though from the inside it’s clear that no individual change is large or risky.”) and advices how to deal with situations where improvements have reached a local maxima by making the design temporarily intentionally worse (f.ex. by inlining all the ugly methods or writing to both the old and the new data store); strongly recommended
    • Related: Efficient Incremental Change – transmute risk into time by doing small, safe steps, then optimize your ability to make these steps quickly and thus being able to achieve large changes
  • Researchers: It is not profitable to outsource development – the Scandinavian research organisation SINTEF ICT has studied the effects of outsourcing and discovered that often it is more expensive than in-country development due to hidden costs caused by worse communication and cultural differences (f.ex. Indians tend not to ask questions and work based on their, often incomplete, understanding) and very high people turn-over; even after the true cost is discovered, companies irrationally stay there. However it is possible to succeed, in some cases.
  • Bjørn Borud: Tractor pulling and software engineering – very valuable and pragmatic advices on producing good software (i.e. avoiding accumulating so much crap that the software just stops progressing). Don’t think only about the happy path. Simplify. Write for other developers, i.e. avoid too “smart” solutions, test & document, dp actually think about design and its implication w.r.t performance etc. Awake the scientist in you: “Do things because you know they work, not because it happens to be the hip thing to do.”
    (Note: I see the good intention behind “design for the weakest programmer you can think of” but plase don’t take it too far! Software should be primarily simple, not necessarily easy.
  • Know your feedback loop – why and how to optimize it – to succeed, we need to learn faster; the only way to do that is to optimize our feedback loops, i.e. shorten the path our assumptions travel before they are (in)validated, whether about our code, business functionality, or the whole project idea. Conscise, valuable.
  • Code quality is the least important reason to pair program – the author argues, based on his experience, other benefits of pair programming are more important than code quality: “[..]  the most important reasons why we pair: it contributes to an amazing company culture, it’s the best way to bring new developers up to speed, and it provides a great way to share knowledge across the development team.”
  • You Can’t Refactor Your Way Out of Every Problem – refactoring can’t help you if the design is fundamentally wrong, you need to rewrite it; know when it can or cannot help and act accordingly (related to how much design is needed upfront since some design decision cannot be reverted/improved upon)

Languages

  • Josh Bloch: Java – the good, bad and ugly parts (video, 15 min); summary: right design decisions (VM, GC, threads, dynamic linking, OOP, static typing, exceptions, …), some bad details (signed byte, lossy long-> double, == doesn’t cal .equals, ability to call overriden methods from constructors, …); Mr. Bloch has also given a longer talk examining the evolution of Java from 1.0 to 1.7 in The Evolution of Java: Past, Present, and Future.
  • True Scala complexity – a thoughtful criticism of the complexity of Scala, based on code samples; “[it is true that] Scala is a language with a smaller number of orthogonal features than found in many other languages. [...] However, the problem is that each feature has substantial depth, intersecting in numerous ways that are riddled with nuances, limitations, exceptions and rules to remember. It doesn’t take long to bump into these edges, of which there are many.”; however, its possible to avoid many of the problems mentioned by resorting to less smart, more clumsy and verbose Java-like code; also, the author still likes Scala.
  • Scala or Java? Exploring myths and facts (3/2012) – a balanced view of Scala’s strengths and weaknesses; “[..] the same features that makes Scala expressive can also lead to performance problems and complexity. This article details where this balance needs to be considered.” Topics: productivity, complexity, concurrency support, language extensibility, Java interoperability, quality of tooling, speed, backward compatibility. Plenty of useful links.

Big data & Cloud:

  • Dean Wampler’s slides from Beyond Map Reduce – 1) Hadoop Map Reduce is the EJB 2 of big data but there are better APIs such as Cascading with Scala/Clojure wrappers; there are also “alternative” solutions like Spark and Storm; 2) functional/relational programming with simple data structures (lists, sets, maps etc.) is much more suitable for big data than OOP (for we do mostly stateless data transformations)
  • Apache HBase vs Apache Cassandra – comparison sheet – if you want to decide between the two
  • Optimizing MongoDB on AWS – 20 min talk about the current state of the art. Simplicity: Mongo AMIs by 10gen, Cloudformation template etc. Stability & perf.: new storage options – EBS with provisioned IOPS volumes (high I/O) + EBS Optimized Instances (dedicated throughput to EBS), High IO instances (hi1.4xlarge – SSD)); comparison of throughput (number of operations, MBs) of these storages; tips for filesystem config. Scalability: scale horizontally and vertically, shrink as needed.
  • Getting Real About Distributed System Reliability by Jay Kreps, the author of the Voldemort DB: distributed software is NOT somehow innately reliable; a common mistake is to consider only probability of independent failures but failures typically are dependent (e.g. network problems affect the whole data center, not a single machine); the theoretical reliability “[..] is an upper bound on reliability but one that you could never, never approach in practice”; “For example Google has a fantastic paper that gives empirical numbers on system failures in Bigtable and GFS and reports empirical data on groups of failures that show rates several orders of magnitude higher than the independence assumption would predict. This is what one of the best system and operations teams in the world can get: your numbers may be far worse.” The new systems are far less mature (=> mor bugs, worse monitoring, less experience) and thus less reliable (it takes a decade for a FS to become mature, likely similar here). Distributed systems are of course more complex to configure and operate. “I have come around to the view that the real core difficulty of these systems is operations, not architecture or design.” Some nice examples of failures.

Other

  • Talks To Help You Become A Better Front-End Engineer In 2013 (tl;dr) – topics such as mobile web development, modern web devel. workflow, current/upcoming featrues of CSS3, ECMAScript 6, CSS preprocessors (LESS etc.), how to write maintainable JS, modular CSS, responsive design, JS debugging, offline webapps, CSS profiling and speed tracer, JS testing
  • On Being A Senior Engineer – valuable insights into what makes an engineer “senior” (i.e. mature; from the field of web operations but applies to IT in general): mature engineers seek out constructive criticism of their designs, understand the non-technical areas of how they are perceived (=> assertive, nice to work with etc.), understand that not all of their projects are filled with rockstar-on-stage work, anticipate the impact of their code (on resource usage, others’ ability to understand & extend it etc.), lift the skills and expertise of those around them, make their trade-offs explicit when making decisions, do not practice “Cover Your Ass Engineering,” are able to view the project from another person’s (stakeholder’s) perspective, are aware of cognitive biases (such as the Planning Fallacy), practice egoless programming, know the importance of (sometimes irrational) feelings people have.

Clojure Corner

  • Polymorphism in Clojure – Tim Ewald’s 1h live coding talk at Øredev conference introducing mechanisms for polymorphism (and Java interoperability) in Clojure and explaining well the different use cases for them. Included: why records, protocols & polymorphism with them (shapes, area => open, not explicit switch) (also good for Java interop.: interfaces), reify, multimethods.
  • Stuart Sierra: Thinking in Data (1h talk) – Sierra introduces data-oriented programming, i.e. programming with generic, immutable data structures (such as maps), pure functions, and isolated side-effects. Some other points: Records are an optimization, only for perforamnce (rarely) or polymorphism (ot often); the case for composable functions;  testing using simulations (generative testing) etc.; visualization of state & process

Tools & Libs

  • Netflix’ Hysterix: library to make distributed systems more resilitent by preventing a single slow/failing dependency from causing resource (thread etc.) exhaustion etc. by wrapping external calls in a separate thread with clear timeouts and support for fallbacks, with good monitoring etc. Read “Problem Definition” on the page to understand the problem it tries to solve.

Favorite Quotes

if you build something that is fundamentally broken it isn’t really interesting that you followed the plan or you followed some methodology — the thing you built is fundamentally broken.

- Bjørn Borud, Chief Architect at Comoyo.no, in an email 12/2012

The root of the Toyota Way is to be dissatisfied with the status quo; you have to ask constantly, “Why are we doing this?”

- Katsuaki Watanabe, Tyota President 2005 – 2009 (from the talk Deliberate Practice)

Posted in General, Java, Tools, Top links of month | Tagged: , , , , , , , , , , , | Leave a Comment »

Note: Loading Tab-Separated Data In Cascalog

Posted by Jakub Holý on October 9, 2012

To load all fields from a tab-separated text file in Cascalog we need to use the generic hfs-tap and specify the “scheme” (notice that loading all fields and expecting tab as the separator is the default behavior of TextDelimited):

 (hfs-tap
   (cascading.scheme.hadoop.TextDelimited.)
   "hdfs:///user/hive/warehouse/playerevents/epoch_week=2196/output_aewa-analytics-ada_1334697041_1.log")

With a custom separator and fields:

 (hfs-tap
   (cascading.scheme.hadoop.TextDelimited. (cascalog.workflow/fields ["?f1" "?f2"]) "\t") ; or cascading.tuple.Fields/ALL inst. of (fields ...)
   "hdfs:///user/hive/warehouse/playerevents/epoch_week=2196/output_aewa-analytics-ada_1334697041_1.log")

Hadoop doesn’t manage to load data files from nested sub-directories (for example from a Hive partitioned table). To load them, you need to use a “glob pattern” to turn the standard Hfs tap into a GlobHfs tap. This is how we would match all the subdirectories (Hadoop will then handle loading the files in them):

 (hfs-tap
   (cascading.scheme.hadoop.TextDelimited.)
   "hdfs:///user/hive/warehouse/playerevents/"
   :source-pattern "epoch_week=*/")

Enjoy.

Posted in General, Java, Tools | Tagged: , , | Leave a Comment »

Enabling JMX Monitoring for Hadoop And Hive

Posted by Jakub Holý on September 21, 2012

Hadoop’s NameNode and JobTracker expose interesting metrics and statistics over the JMX. Hive seems not to expose anything intersting but it still might be useful to monitor its JVM or do simpler profiling/sampling on it. Let’s see how to enable JMX and how to access it securely, over SSH.

Read the rest of this entry »

Posted in Tools | Tagged: , , , , | 2 Comments »

How to Add MapRed-Only Node to Hadoop

Posted by Jakub Holý on August 8, 2012

I was surprised not to be able to google an answer to this so I want to record my findings here. To add (a.k.a. commision) a node to Hadoop cluster that should be used only for map-reduce tasks and not for storing data, you have multiple options:

  1. Do not start the datanode service on the node
  2. If you’ve configured Hadoop to allow only nodes on its whitelist files to connect to it then add it to the file pointed to by the property mapred.hosts but not to the file in dfs.hosts.
  3. Otherwise add the node to the DFS’ blacklist, i.e. file pointed to by the property dfs.hosts.exclude and execute hadoop dfsadmin -refreshNodes on the namenode to apply it.

Read the rest of this entry »

Posted in General, Tools | Tagged: , , | Leave a Comment »

Most interesting links of April ’12

Posted by Jakub Holý on April 30, 2012

Recommended Readings

  • V. Duarte: Story Points Considered Harmful – Or why the future of estimation is really in our past… (also as 1h video) – thoughtful and data-backed claim that there is a much cheaper way for estimating work throughput than estimating each story in story points (SP) and that is simply counting the stories. Even though their sizes differ, over (not that much) longer periods, where it really matters, these differences will even out. The author argues that estimating in number of stories provides the same reliability and benefits as SP and is much easier. (Keep in mind that estimation is just an attempt at predicting the future and humans are proved to be terrible at doing that; why to pretend that we can do it?) I’d recommand this to anybody doing Scrum and similar.
  • M. Fowler: Test Coverage – it’s obvious that increasing test coverage for the sake of test coverage it’s a nonsense but some people still need to be reminded of it :-) . Fowler explains what the real benefit of test coverage measurements is and how to use it for good instead of for evil.
  • Brian Marick: How to Misuse Code Coverage (pdf) – cited a lot by Fowler in his article, this is really a good paper. Marick has participated in the development of several code coverage tools and understands well their limitations. One of the key points is that code coverage tools can discover only one class of test weakness (not testing some paths through your code) but cannot discover that you are missing some code you should have (e.g. when you check only for two of three possible return values). Thus the code coverage metric tells you “this code isn’t well tested, are you sure you don’t to look more into it”? It’s crucial not to write tests so as to increase the code coverage; look at the code and improve the test without any regard for coverage. You may thus decrease the likeliness of both the class of problems.
  • A Year with MongoDB – Kiip has found out that Mongo isn’t the best choice for them (having 240GB, 500+ operations/s, 85M docs and their specific usage of the store) and migrated to the combination of Riak (key-value store) and PostgreSQL. Some of the issues they hit are slow counts and limit/offset queries due to using non-counting B-trees for indexing, memory management that could be more intelligent and tuned for the use to make sure the data needed is indeed in RAM, no built-in support for compressing key names (their size adds up as they’re repeated in each document; you’ve to compress them [user -> u etc.] in the client if you want to), limited concurrency due to process-wide write lock (which becomes a problem if the write’s aren’t short enough w.r.t. number of ops/s, e.g. because data isn’t in RAM and/or the query is complicated), safe settings (waiting for a write to finish, …) off by default, offline-only table compaction (w/o it the disk usage grows unbounded). The lessons learnt for me: Know your storage, its weaknesses and intended way of usage, and make sure it matches your needs.
  • Rudolf Winestock: The Lisp Curse – Lisp’s expressive power is actually a cause of its lack of momentum because it’s so easy to implement anything that people have no need to join forces and thus there are many half-baked (“works-for-me”) solutions for anything – but no complete, generally accepted one. An interesting essay. “Lisp is so powerful that problems which are technical issues in other programming languages are social issues in Lisp.”
  • Understanding JDBC Internals & Timeout Configuration – the article itself could have been written better but it conveys the important information that configuring timeouts for JDBC isn’t trivial because they need to be set correctly at different levels and without a socket timeout set in a driver-specific way it can hang forever if the DB cannot be reached due to network/system failure
  • Circos: An Amazing Tool for Visualizing Big Data – this article is interesting primarily for its combination of Google Analytics API, Neo4J and an unusual data visualization with circular graphs

Tools

  • CRaSH: Extensible shell for the JVM (docs) – a shell that you can embedd into a web server as a WAR, run standalone or attach to a running JVM, connect to it via SSH or Telnet, and use it to execute commands against the JVM. Some commands: configure loggers, control threads, monitor the system (mem, threads, ..), connect/issue queries via JDBC. More commands can be written in Groovy. There is a whole set of commands for working with JCR. Pluggable authentication.

Clojure Corner

Posted in Databases/DB2, General, Testing, Tools, Top links of month | Tagged: , , , | Leave a Comment »