Wikimedia architecture Ryan Lane Wikimedia Foundation Inc.

Slides:



Advertisements
Similar presentations
Adding scalability to legacy PHP web applications Overview Mario A. Valdez-Ramirez.
Advertisements

“It’s going to take a month to get a proof of concept going.” “I know VMM, but don’t know how it works with SPF and the Portal” “I know Azure, but.
Brion Vibber Google Tech Talk April 28, 2006 Wikimedia Foundation, Inc. Wikipedia and MediaWiki What are those crazy wiki people up to anyway?
Building Wikipedia Scalable LAMP on a shoestring budget Brion VibberGatorJUG
Upcoming MediaWiki goodies (aka, Wikipedia takes over everything) Brion Vibber Wikimedia Foundation, Inc. MySQL Users Conference Santa Clara, CA April.
World Wide Web Caching: Trends and Technology Greg Barish and Katia Obraczka USC Information Science Institute IEEE Communications Magazine, May 2000 Presented.
Web Cache. Introduction what is web cache?  Introducing proxy servers at certain points in the network that serve in caching Web documents for faster.
Web Programming Language Dr. Ken Cosh Week 1 (Introduction)
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Introduction to InfoSec – Recitation 10 Nir Krakowski (nirkrako at post.tau.ac.il) Itamar Gilad (itamargi at post.tau.ac.il)
22-Aug-15 | 1 |1 | Help! I need more servers! What do I do? Scaling a PHP application.
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
Ruby on Rails CSCI 6314 David Gaspar Jennifer Garcia Avila.
Web Application Access to Databases. Logistics Test 2: May 1 st (24 hours) Extra office hours: Friday 2:30 – 4:00 pm Tuesday May 5 th – you can review.
Ladd Van Tol Senior Software Engineer Security on the Web Part One - Vulnerabilities.
AK Software Company Video Solution Le Dinh Ka:
 2001 Prentice Hall, Inc. All rights reserved. 1 Chapter 21 - Web Servers (IIS, PWS and Apache) Outline 21.1 Introduction 21.2 HTTP Request Types 21.3.
1 Apache. 2 Module - Apache ♦ Overview This module focuses on configuring and customizing Apache web server. Apache is a commonly used Hypertext Transfer.
Building Secure Web Applications With ASP.Net MVC.
Putting Performance Best Practices Together to Create the Perfect SPA Chris Love2Dev.com.
By Ruizhe Ma, Avinash Madineni Sidoine Lafleur Kamgang Nov,
Content Delivery Networks: Status and Trends Speaker: Shao-Fen Chou Advisor: Dr. Ho-Ting Wu 5/8/
Engineering Projects In Community Service Matt Mooney Community Based Research University of Notre Dame.
Scalable Data Scale #2 site on the Internet (time on site) >200 billion monthly page views Over 1 million developers in 180 countries.
Web Cache. What is Cache? Cache is the storing of data temporarily to improve performance. Cache exist in a variety of areas such as your CPU, Hard Disk.
DreamFactory for Microsoft Azure Is an Open Source REST API Platform That Enables Mobilization of Data in Minutes across Frameworks and Storage Methods.
The Site Architecture You Can Edit Varnish Mobile? Ryan Lane Wikimedia Foundation Membase? Swift.
Wikimedia architecture Ryan Lane Wikimedia Foundation Inc.
MirrorManager: The Fedora Mirror System Matt Domsch Fedora Mirror Wrangler Linux Technology Strategist Office of the CTO Dell, Inc.
Web Technology Seminar
Organizational IT Stack
Web Programming Language
Univa Grid Engine Makes Work Management Automatic and Efficient, Accelerates Deployment of Cloud Services with Power of Microsoft Azure MICROSOFT AZURE.
REPLICATION & LOAD BALANCING
What is Laravel ? By Georgi Genov.
Scalable Web Apps Target this solution to brand leaders responsible for customer engagement and roll-out of global marketing campaigns. Implement scenarios.
Affinity Depending on the application and client requirements of your Network Load Balancing cluster, you can be required to select an Affinity setting.
How to be part of the MediaWiki developer community
Web Development Web Servers.
Ad-blocker circumvention System
Trends like agile development and continuous integration speak to the modern enterprise’s need to build software hyper-efficiently Jenkins:  a highly.
The Internet.
Practical Censorship Evasion Leveraging Content Delivery Networks
A10 Networks vThunder Leverages the Powerful Microsoft Azure Cloud Platform to Offer Advanced Layer 4-7 Networking, Security on a Global Scale MICROSOFT.
Finding and Fighting the Causes of Insecure Applications
Azure-Powered Augmented Reality Storytelling Platform for Kids Makes Learning Adaptive, Fun “Azure and its associated storage, content delivery, and virtual.
SUBMITTED BY: NAIMISHYA ATRI(7TH SEM) IT BRANCH
Stylelabs Develops the Marketing Content Hub to Offer Enterprises a High-End Marketing Content Management Platform Based on Microsoft Azure MICROSOFT AZURE.
Processes The most important processes used in Web-based systems and their internal organization.
Nimble Streamer Helps Media Content Providers Create Streaming Networks Cost-Effectively and Easily by Utilizing Azure’s Worldwide Scalability MICROSOFT.
Scalable Web Apps Target this solution to brand leaders responsible for customer engagement and roll-out of global marketing campaigns. Implement scenarios.
Net 323 D: Networks Protocols
Content Management Systems
Internet Applications
Distributed Content in the Network: A Backbone View
Designed for Big Data Visual Analytics, Zoomdata Allows Business Users to Quickly Connect, Stream, and Visualize Data in the Microsoft Azure Platform MICROSOFT.
Packet Switching To improve the efficiency of transferring information over a shared communication line, messages are divided into fixed-sized, numbered.
Built on the Powerful Microsoft Azure Platform, iSwarm Helps Businesses Analyze Social Media Conversations, then Connect with Individuals MICROSOFT AZURE.
Architecture of Large-Scale Websites
Chat Refs: RFC 1459 (IRC).
Data Security for Microsoft Azure
CSC 495/583 Topics of Software Security Intro to Web Security
Keep Your Digital Media Assets Safe and Save Time by Choosing ImageVault to be Your Digital Asset Management Solution, Hosted in Microsoft Azure Partner.
Google App Engine Ying Zou 01/24/2016.
Introduction to Cyberspace
Finding and Fighting the Causes of Insecure Applications
Computer Networks Primary, Secondary and Root Servers
Web Servers (IIS and Apache)
Introduction to Portal for ArcGIS
Yale Digital Conference 2019
Presentation transcript:

Wikimedia architecture Ryan Lane Wikimedia Foundation Inc.

Intro Our technical operations Global architecture Application servers Storage Caching Load balancing Content Delivery Network (CDN) The site architecture you can edit Community involvement Topic s

Top five worldwide sites

Our operations Currently managed by ~6 ops engineers Historically ad-hoc, “fire fighting mode” Technical staff spread out globally Always someone awake...but no on-call Working to engage community operations contributors

Operations communication Most communication public on IRC Documentation in a wiki ( Sensitive communication via private lists and resource trackers

Architecture: LAMP...

...on steroids.

The Wiki software All Wikimedia projects run on a MediaWiki platform Designed primarily for Wikimedia sites Open Source PHP software (GPL) Very scalable, very good localization

MediaWiki optimization We optimize by... caching expensive operations focusing on the hot spots in the code (profiling!) If a MediaWiki feature is too expensive, it doesn't get enabled

MediaWiki caching Caches everywhere Most of this data is cached in Memcached, a distributed object cache

MediaWiki profiling

Core databases One master, many replicated slaves Load balanced reads to slaves, write operations to master Separate database per wiki Separate big, popular wikis from smaller wikis (sharding)

Media storage Current solution is not scalable Replacing with an open source distributed file system Likely OpenStack Swift

Thumbnail generation stat() on each request is too expensive, so assume every file exists If a thumbnail doesn’t exist, ask the application servers to render it

Squid caching Caching reverse HTTP proxy Serves most of our traffic Split into two groups Hit rates: 95% for Text, 98% for Media, since the use of CARP

CARP

Squid cache invalidation Wiki pages are edited at an unpredictable rate Users should always see current revision Invalidation through expiry times not acceptable Purge implemented using multicast UDP based HTCP protocol

Varnish Caching 2-3 times more efficient than Squid Will eventually replace the Squid architecture Currently used for delivering static content such as javascript and css files (bits)

Load Balancing: LVS- DR Linux Virtual Server Direct Routing mode All real servers share the same IP address The load balancer divides incoming traffic over the real servers Return traffic goes directly!

Content Distribution Network (CDN) 2 clusters on 2 different continents: Primary cluster in Tampa, Florida Secondary caching-only cluster in Amsterdam Adding a new primary datacenter in Virginia

Geographic Load Balancing Most users use DNS resolver close to them Map IP address of resolver to a country code Deliver CNAME of close datacenter entry based on country Using PowerDNS with a Geobackend

The Site Architecture You Can Edit Varnish Mobile? Membase? Swift

Test/Dev Architecture

Basic use case Ops makes initial default project Clone of production cluster Used for most test/dev New projects mirror community or foundation initiatives Devs build architecture in new project Devs request merge for puppet changes via gerrit Project instances moved to default project and tested Project moved to production cluster

How to engage the community Discuss Commit Participate

How to engage the community Document Communicate changes

Our philosophy Engage early Release early, release often Scratch your own itch

Coding for WMF: Security Security is important. Really. People rely on developers to write secure code, so: An insecure extension in SVN... An insecure extension on Wikipedia...

Common vulnerabilities to avoid SQL injection Cross site scripting (XSS) Cross site request forgery (CSRF) Register Globals

General notes on security Don't trust anyone Sanitize all input Write code that is demonstrably secure Best of all: try to break and hack your own code

Coding for WMF: Scalability and performance Wikimedia sites are huge 5 th most visited web presence Code must be: Performant Scalable

Coding for WMF: Scalability and performance Cache Profile Optimize Ask for advice!

Coding for WMF: Concurrency Assume a clustered architecture, always Your code will run concurrently It can result in strange bugs

Closing notes We rely heavily on open source Always looking for efficiencies Looking for more efficient management tools Looking for more contributors

Questions, comments? Ryan Lane IRC: Freenode, Nick: Ryan_Lane, Channels: #wikimedia-tech, #wikimedia-operations, #mediawiki, #openstack

Communication resources Mailing lists Important lists: mediawiki-l: A MediaWiki support list wikitech-l: A MediaWiki developer's list mediawiki-api: A MediaWIki developer's list for the API IRC channels (on freenode) #mediawiki: A MediaWiki support channel #wikimedia-dev: A MediaWiki developer's channel

Developer resources - developer hub Developer hub: lists resources, guidelines, and code documentation How to become a MediaWiki hacker: introduction into how to do MediaWiki development Security for developers: essential security documentation

Developer resources Coding conventions: conventions required for all Wikimedia run software Localisation: resources to write code that can be easily localised Code review guide: how your code will be reviewed before inclusion