原文地址 http://www.zhizhihu.com/html/y2012/3922.html
Big Data Linkshare是订阅的一个邮件列表,忘了怎么订阅的了。
但是里面的东西挺好的。这一期的内容很精彩啊很精彩。
版权:Copyright © 2012 Israeli Big Data Linkshare, All rights reserved.
The past few weeks had been too hectic to find time for writing the newsletter. I hope going forward I'll be able to keep to a regular schedule; Nevertheless, next week I'm travelling so the next issue will only come out the last week of September. But if any of you are in London and want to have a beer and discuss Data Science, drop me a line!
Time Series and Multivariate Analysis in R R Statistics
Book (75 pp.): http://a-little-book-of-r-for-time-series.readthedocs.org/en/latest/index.html
A series of three short, free books about data analysis in R: The first one discusses Time Series analysis (decomposition to trend and seasonal components, exponential smoothing, ARIMA and the likes).
The second is about multivariate analysis and covers PCA, LDA and other topics
The last one is on Biomedical statistics so might be a little specific in nature. It is linked to from the other pages.
Mining Twitter with MongoDB and MapReduce Big Data
Demonstration of how to use a Ruby Twitter gem to scrape tweets into MongoDB, and then run a Mongo map-reduce task (non-Hadoop) to get a histogram of tweets per hour of day.
Optimizing Pig Jobs Big Data
This guy did his PhD project on running DB benchmarks against Hive and Pig and discovering how to optimize them. He has a few tips stashed away in a powerpoint attached to a JIRA issue linked from the post. They are:
- Reorder JOINs properly
- Use COGROUP for JOIN + GROUP
- Use FLATTEN for self-join
- Project before (CO)GROUP
- Remove types in LOAD
- Use hash-based aggregation
The post itself goes into detail about the benchmarks they ran and why different improvements like cutting down on Pig's generated MR jobs (e.g Pig generates 3 jobs for an ORDER BY clause), or enabling map aggregation, are significant.
Unique Python Data Structures Python
The short blog post covers several libraries implementing Bloom filters, Trie trees, efficient lists or arrays, and general purpose graph libraries for Python. Some of these are especially fit for big data or NLP so I opted to include this post.
Scaling Deep Learning by Google ML Big Data
Jeff Dean of Google fame (his works were reviewed here before) gives a technical talk about how google runs fast, parallel optimization algorithms (such as SGD or L-BGFS) from an engineering standpoint.
Cardinality Estimation Algorithms Big Data Python
Continuing the ideas covered in former blog posts featured here on sketches and other probabilistic counting methods, this post covers and demonstrates (with Python) SuperLogLog and HyperLogLog.
Re-Expression optimizes your ML models ML
This site includes a very abstract description and reference implementation, in C++, of the re-expression method, whereby Naive Bayes model variables are re-encoded into progressively more complex combinations to better fit the data, while avoiding the creation of dependent variables.
Recommendation engine in Hadoop with Python ML Big Data Python
A very detailed walkthrough of how to write python MapReduce jobs with mrjob to do collaborative filtering. The algorithm implemented is item-item similarity generated from user-item preference data. Various measures are demonstrated (cosine similarity, jaccard distance and correlation coefficient).
Fast Bayesian Inference with STAN Statistics
A lot of excitement had followed the release of Stan, a BUGS-alternative for running monte carlo simulations. Underneath, Stan generates and compiles C++ code from the BUGS-like model description, which runs super fast Hamiltonian MCMC to converge to the model distribution. It comes with R bindings so that it's easy to use from your research code.
Survey of NLP papers NLP
To conclude, I'll be lazy and link to someone else's summaries: Alexandre summarizes the papers he found interesting from the EMNLP 12' conference. As mentioned, I haven't had the time to review almost anything longer than a page in the past month.